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The following three overarching themes have been fully integrated throughout the Pearson Edexcel 
AS and A level Mathematics series, so they can be applied alongside your learning and practice. 

1. Mathematical argument, language and proof 

e Rigorous and consistent approach throughout 

e Notation boxes explain key mathematical language and symbols 

¢ Dedicated sections on mathematical proof explain key principles and strategies 

¢ Opportunities to critique arguments and justify methods 


2. Mathematical problem solving The Mathematical Problem-solving cycle 
¢ Hundreds of problem-solving questions, fully integrated specify the problem [> 

into the main exercises 
e Problem-solving boxes provide tips and strategies interpret results 


collect information 
e Structured and unstructured questions to build confidence 1 
e Challenge boxes provide extra stretch process and j 
, : ~—— represent information 
3. Mathematical modelling 
¢ Dedicated modelling sections in relevant topics provide plenty of practice where you need it 


e Examples and exercises include qualitative questions that allow you to interpret answers in the 
context of the model 


¢ Dedicated chapter in Statistics & Mechanics Year 1/AS explains the principles of modelling in 
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Use of technology y 
Explore topics in more detail, visualise ' Online ) Find the point of intersection 

problems and consolidate your understanding graphically using technology. 

using pre-made GeoGebra activities. 


GeaGebra 


GeoGebra-powered interactives 


Interact with the maths you are learning 
using GeoGebra's easy-to-use tools 


Access all the extra online content for free at: 


www.pearsonschools.co.uk/fs2maths 


You can also access the extra online content by scanning this QR code: 


vi 


After completing this chapter, you should be able to: 
e@ Calculate the equation of a regression line using raw data or 


summary statistics — pages 2-8 
e@ Use coding to find the equation of a regression line — pages 8-10 

Calculate residuals and use them to test for linear fit and identify 

outliers — pages 10-15 
e@ Calculate the residual sum of squares (RSS) — pages 13-15 


1 The table shows the time, ¢ minutes, taken to make 
p grams of a product during a chemical reaction. 
t 1 3 4 I 
Do tay 89) 30 


The equation of the regression line of p on ris given as 
p=2.84+3.85¢. 


a Interpret the value 3.85 in this equation. 


Sports scientists use b Comment on the validity of using this regression line 


regression models to predict to estimate 
physiological characteristics. i the amount of chemical produced after 5 minutes 


This can help athletes ii. the amount of chemical produced after 10 minutes 
optimise their performance. iii the time taken to produce 20 grams of product. 
— Mixed exercise Q5 < SM1, Chapter 4 


Chapter 1 


1.1) Least squares linear regression 


When you are analysing bivariate data, you can Voutnouldiniy ace tnercenevontce 


use a least squares regression line to predict 
values of the dependent (response) variable for 
given values of the independent (explanatory) 


to make predictions for values of the dependent 
variable that are within the range of the given 
data. This is called interpolation. Making 


variable. If the response variable is y and the predictions for values outside of the range of the 
explanatory variable is x, you should use the given data is called extrapolation and produces 
regression line of y on x, which can be written a less reliable prediction. € SM1, Section 4.2 


in the form y=a + bx. 


The least squares regression line is the line that minimises the sum of the squares of the residuals 
of each data point. 


= The residual of a given data point is the 
difference between the observed value of Notation ] The Greek letter epsilon (e) is 
the dependent variable and the predicted sometimes used to denote a residual. 
value of the dependent variable. 


€, is the residual of the data point (x,, y,) 


The least squares regression line of y on x is the straight line 
that minimises the value of ¢,2 + €,* + €3° + €,4. In general, if 
each data point has residual e; the regression line minimises 
the value of oe. 


The observed value of the dependent variable, y,, is less than 
the predicted value, so the residual of (x3, y.) will be negative. 


O x 


You need to be able to find the equation of a least squares regression line using raw data or 
summary statistics. 


= The equation of the regression line of 
yon xis: 


y=atbx 


Sry ar 3 
where b = ~~ anda=y - bx t Watch out | You can calculate a and b directly 


from raw data using your calculator. However, you 
might be given summary statistics in the exam so 
you need to be familiar with these formulae. 


S\,, and S,.. are known as summary statistics and 
you can calculate them using the following 


formulae: 

7 Say = oxy = 2rd 
Sxx = Se _ (ex 
Syy = dy? deh: 


Linear regression 


Example 


The results from an experiment in which different masses were placed on a spring and the resulting 
length of the spring measured, are shown below. 


Mass, x (kg) 20 | 40 60 80 | 100 
Length, y (cm) 48 | 55.1 | 56.3 | 61.2 | 68 


a Calculate S,.. and S,,. 
(You may use )ox =300 5 \x2=22000 x=60 So xy=18238 doy? =16879.14 
Soy = 288.6 y=57.72) 
b Calculate the regression line of y on x. 
c Use your equation to predict the length of the spring when the applied mass is: 
i 58kg 
ii 130kg 


d Comment on the reliability of your 


predictions. ' Online ) Explore the calculation of a least 
squares regression line using GeoGebra. 


a 


xe 
a S,,=> x? - yi 


3002 
5 


= 22000 - 
= 4000 


Se 2Dy ; 


= 18236 — 300 x 2686 
= 922 


Sw _ 922 
S,, 4000 


b b= = 0.2305 | 


a=y-bx 
= 57.7 2.— 0.2305. 8 60 


= 43.69 
y = 43.69 + 0.2305x 
ei y=43.69 + 0.2305 x 58 


= 57.3 (3 S44) 
ii y= 43.89 + 0.2305 x 130 
= 739 em (3°Si,) 


d Assuming the model is reasonable, the 
prediction when the mass is 56kq is 
reliable since this is within the range of 
the data. 

The prediction when the mass is 130kq is 


less reliable since this is outside the range — Thisiscalledextrapolation = 
3 


of the data. 


Chapter 1 


Example 


A scientist working in agricultural research believes that there is a linear relationship between the 
amount of a certain food supplement given to hens and the hardness of the shells of the eggs they 
lay. As an experiment, controlled quantities of the supplement were added to the hens’ normal diet 
for a period of two weeks and the hardness of the shells of the eggs laid at the end of this period 
was then measured on a scale from | to 10, with the following results: 


Food supplement, f (g/day) 2 4 6 8 10 12 14 
Hardness of shells, h 3.2 5.2 5.5 6.4 7.2 8.5 9.8 


a Find the equation of the regression line of / on f. 
(Youmayuse>-f=56 > A=458 f=8 h=6543 SY -f2?=560 > fh=422.6) 
b Interpret what the values of a and 4 tell you. 


fh t Watch out ) The variables given might not be 


a Sy = difh - n x and y. Be careful that you use the correct 
owe 96 x 45.6 _ 562 values Taree. substitute Unge the formulae. 
7 It can sometimes help to write x next to the 
: (of) explanatory variable in the table (f) and y next 
Sp= Uf? - n to the response variable (/). 
56° 
= -— =11 
560 5 2 
Ke Sm_ 56.2 
Sy 12 
= 0.5017... hardness units per g per day 


a=h—bf 
= 6.543 -0.5017...x 8 
= 2.5287... hardness units + 
h= 2.53 + 0.502f 


b a estimates the shell strength when no 
supplement is given (i.e. when f = O). 
Zero is only just outside the range of f 
so it is reasonable to use this value. 


b estimates the rate at which the hardness 
increases with increased food supplement; 


in this case for every extra one gram of 
food supplement per day the hardness 


increases by 0.502 (3 5.f) hardness units. 


Linear regression 


Example 


A repair workshop finds it is having a problem with a pressure gauge it uses. It decides to have it 
checked by a specialist firm. The following data were obtained. 


Gauge reading, x (bars) 10 | 14 | 18 | 22 | 26 | 30 | 34 | 38 
Correct reading, y (bars) 0.96 1.33 1.75 | 2.14 | 2.58 | 2.97 | 3.38 | 3.75 
(You may use }>x=19.2 Sox?=52.8 Yiy=18.86 YS -y2=51.30 Sd xy=52.04) 


a Show that S,, = 6.776 and find S,... 
It is thought that a linear relationship of the form y = a + bx could be used to describe these data. 


b Use linear regression to find the values of a and 4 giving your answers to 3 significant figures. 


c Draw a scatter diagram to represent these data and draw the regression line on your diagram. 


d The gauge shows a reading of 2 bars. Using the regression equation, work out what the correct 


reading should be. 


Lexy | 


a S,=dxy- =F 


= 52.04 - 


ines = 


n 
2 
perme ek 
s 


S 
ww 6.776 
Si. G2 
_ 18.86 


~ 8 


ne 


a=y-bx 


= -0.0625 
Regression li 
or y = 1008x - 0.0625 


19.2 x 18.86 


=6. 


= OOS Sine 


= 6.776 


VER 


= 1.0083... 


Ve 
6 


née is: y = —0.0625 + 1.008x 


VA 


a 


Correct reading (bars) 


O 


y = (1.008 x 2 
YH 125 bat (Sst) 


2 
Gauge readi 


3 


I 


4 
(bars) 


- 0.0625 


«V 


Chapter 1 


Exercise 


1 


2 


3 


4 


5 


The equation of a regression line in the form y = a + bx is to be found. Given that S,.. = 15, 
S» = 90, ¥ = 3 and y = 15, work out the values of a and b. 


Given that S}., = 30, S,,,= 165, x =4 and y = 8, find the equation of the regression line of y on x. 


The equation of a regression line is to be found. The following summary data is given: 
S.. = 40 Si = 80 x=6 yel2 


Find the equation of the regression line in the form y = a + bx. 


Data is collected and summarised as follows: 


yx = 10 =x? = 30 diy = 48 >dexy = 140 n=4 
a Work out X, y, S,, and S,,. 


b Find the equation of the regression line of y on x in the form y = a + bx. 


For the data in the table, 


5 8 10 Hint } Check your answer using the statistical 
3 7 8 13 17 functions on your calculator. 


a calculate S\, and S,., 


b find the equation of the regression line of y on x in the form y =a + bx. 


Research was done to see if there is a relationship between finger dexterity and the ability to do 
work on a production line. The data is shown in the table. 


Dexterity score, x 29 3 3.5 4 5 5 5.5 6.5 7 8 
Productivity, y 80 130 | 100 | 220 | 190 | 210 | 270 | 290 | 350 | 400 


The equation of the regression line for these data is y = —59 + 57x. 
a Use the equation to estimate the productivity of someone with a dexterity of 6. 
b Give an interpretation of the value of 57 in the equation of the regression line. 


c State, giving in each case a reason, whether or not it would be reasonable to use this equation 
to work out the productivity of someone with dexterity of: 


i2 ii 14 


A field was divided into 12 plots of equal area. Each plot was fertilised with a different amount 
of fertilizer (4). The yield of grain (g) was measured for each plot. Find the equation of the 
regression line of g on / in the form g = a + DA given the following summary data. 


Sn=22.09 Yig=49.7 Yi =45.04 Yog2?=244.83 hg =97.778 n=12 


Linear regression 


() 8 Research was done to see if there was a relationship between the number of hours in the 
working week (w) and productivity (p). The data are shown in the two scatter graphs below. 


Pp Ti 
70 -EESEHEEEEEEE 
60 FEE 
50-4 ECE 
404} 
30-EEEH 
204-4 
0 5 10 15 20 
(You may use )°-p = 397 =p? = 16643 SY w=186 Dd w2=3886 > pw = 6797) 
a Calculate the equation of the regression line of p on w, giving your answer in the form 

p=a-bw. 


20 30 40 50 60 70 P 


va 


25 30 ¥ 


b Rearrange this equation into the form w = c + dp. 
The equation of the regression line of w on p is w = 45.0 — 0.666p. 
c Comment on the fact that your answer to part b is different to this equation. 


d Which equation should you use to predict: 
i the productivity for a 23-hour working week 
ii the number of hours in a working week that achieves a productivity score of 40. 


@) 9 Inachemistry experiment, the mass of chemical produced, y and the temperature, x are 


recorded. 
x (°C) 100 110 120 130 140 150 160 170 180 190 200 
y(mg) | 34 39 41 45 48 47 41 35 26 15 3 


Maya thinks that the data can be modelled using a linear regression line. 


a Calculate the equation of the regression line of y on x. Give your answer in the form 
y=atbx. 


b Draw a scatter graph for these data. 
c Comment on the validity of Maya’s model. 


10 An accountant monitors the number of items produced per month by a company (n) together 
with the total production costs (p). The table shows these data. 


Number of items, 7 (1000s) 21 | 39 | 48 | 24 | 72 | 75 | 15 | 35 | 62 | 81] 12 | 56 
Production costs, p (£1000s) | 40 | 58 | 67 | 45 | 89 | 96 | 37 | 53 | 83 | 102} 35 | 75 


(You may use } /n = 540 dn? = 30786 t Watch out ) The numbers of items are given in 


= 780 2 = 56936 1000s. Be careful to choose the correct value to 
dp dp 

S-np = 41444) substitute into your regression equation. 
a Calculate S,,, and S,,,. (2 marks) 
b Find the equation of the regression line of p on nin the form p = a + bn. (3 marks) 
c Use your equation to estimate the production costs of 40000 items. (2 marks) 
d Comment on the reliability of your estimate. (1 mark) 


Chapter 1 


11 A printing company produces leaflets for different advertisers. The number of leaflets, 7, 


EP) 12 


measured in 100s and printing costs £p are recorded for a random sample of 10 advertisers. 
The table shows these data. 


n (100s) 1 3 4 6 8 12 | 15 | 18 | 20 | 25 
p(pounds) | 22.5 | 27.5 | 30 | 35 | 40 | 50 | 57.5 | 65 | 70 | 82.5 


(You may use )>n=112  Son?=1844 Sop=480 YS op?=26725 S onp = 6850) 


a Calculate S,,, and S,,,. (2 marks) 
b Find the equation of the regression line of p on nin the form p = a + bn. (3 marks) 
c Give an interpretation of the value of 6. (1 mark) 


An advertiser is planning to print ¢ hundred leaflets. A rival printing company charges 5p per 
leaflet. 
d Find the range of values of t for which the first printing company is cheaper than 

the rival. (2 marks) 


The relationship between the number of coats of paint applied to a boat and the resulting 
weather resistance was tested in a laboratory. The data collected are shown in the table. 


Coats of paint, x 1 2 3 4 5 
Protection, y (years) 14 | 2.9 | 4.1 5.8 | 7.2 


a Use your calculator to find an equation of the regression line of y on x as a model for these 
results, giving your answer in the form y = a + bx. (2 marks) 
b Interpret the value 4 in your model. (1 mark) 


ce Explain why this model would not be suitable for predicting the number of coats of paint 
that had been applied to a boat that had remained weather resistant for 7 years. (1 mark) 


d Use your answer to part a to predict the number of years of protection when 7 coats of 
paint are applied. (2 marks) 
In order to improve the reliability of its results, the laboratory made two further observations: 


Coats of paint, x 6 8 


Protection, y (years) 8.2 | 9.9 


e Using all 7 data points: 
i produce a refined model 
ii use your new model to predict the number of years of protection when 7 coats of paint 


are applied 
iii. give two reasons why your new prediction might be more accurate than your original 
prediction. (5 marks) 


Sometimes the original data is coded to make it easier to manage. You can calculate the equation of 
the original regression line from the coded one by substituting the coding formula into the equation 
of the coded regression line. 


Linear regression 


Example 


Eight samples of carbon steel were produced with a different percentages, c%, of carbon in them. 


Each sample was heated in a furnace until it melted and the temperature, m in °C, at which it 
melted was recorded. 


The results were coded such that x = 10c and y = oe 

The coded results are shown in the table. 
Percentage of carbon, x 1 2 3 4 5 6 
Melting point, y 35 28 24 16 15 12 


a Calculate S\,, and S,.. 
(You may use )\x? = 204 and } oxy = 478.) 
b Find the regression line of y on x. 


c Estimate the melting point of carbon steel which contains 0.25% carbon. 


Lay 

a Sy =D _xy - n . 
36 x 144 | 
6 x 154 - 


= 478 - mde 
2 
x 2 
ie = Ql" = 204 - oo" = 42 
Sxy _ -170 
b b= y= Gp = -4.047... 
a=y-bx 


mae 36 _ 
- . ‘ ae : . - rs Dew 
y= 36.2 = 405x 


eo enea | t Watch out J y and x are coded values. You can 


If = 0.25, then x= 10 x 0.25= 2.5 either code the given value of c, then reverse the 


Y= 36.214,..— 4.047... x 2.5 = 26.095... coding for the resulting value of y (method 1). 
ae m — 700 Or you can convert your regression equation in 
) y and x to an equation in m and c (method 2). 

m= 5y + 700 

=5 x 26.095... + 700 = 830 (3 sf) 
Method 2 - 
y= 36.214... — (4.047...)x 
mae = 36.214... - 4.047... x 10¢ 


m— 700 = 181.07... — (202.38...)e 
m= 881.07... — (202.36...)c 
= 661.07... — (202.38...) x 0.25 
= 830 (3 5) 
The estimate for the melting point is 
O30°C.(3 Sf) 
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Exercise 


1 Given that the coding p = x + 2 and g= y — 3 has been used to get the regression equation 
p+q=5, find the equation of the regression line of y on x in the form y =a + bx. 


2 Given the coding x = p — 10 and y = s — 100 and the regression equation x = y + 2, work out the 
equation of the regression line of s on p. 


3 Given that the coding g = S and h = rn — 2 has been used to get the regression equation 


h= 6 — 4g, find the equation of the regression line of y on x. 


4 The regression line of t on s is found by using the coding x = s- 5 and y=t-— 10. 
The regression equation of y on x is y = 14 + 3x. 
Work out the regression line of tons. 


5 A regression line of c on dis worked out using the coding x = 5 and y= a 


a Given that S,,, = 120, S,,. = 240, the mean of x(X) is 5 and the mean of y(7) is 6, calculate the 
regression line of y on x. 


b Find the regression line of donc. 


6 Some data on the coverage area, am”, and cost, £c, of five boxes of flooring were collected. 


The results were coded such that x = a8 and y = . x 5 | 10 | 16 | 17 
The coded results are shown in the table. y 9 12 | 16 | 21 | 23 
a Calculate S,,,and S,,,.and use them to find the equation of the regression line of 

yonx. (4 marks) 
b Find the equation of the regression line of c on a. (2 marks) 
c Estimate the cost of a box of flooring which covers an area of 32m?. (2 marks) 


7 A farmer collected data on the annual rainfall, xcm, and the annual yield of potatoes, p tonnes 
per acre. 


The data for annual rainfall was coded using v = x and the following statistics were found: 
S= 1021 Spy = 15.26 Spp = 23.39 Pp = 9.88 vy =4.58 
a Find the equation of the regression line of p on vin the form p =a + by. (3 marks) 


b Using your regression line, estimate the annual yield of potatoes per acre when the annual 
rainfall is 42cm. (2 marks) 


1.2) Residuals 


You can use residuals to check the reasonableness of a linear fit and to find possible outliers. 


= If aset of bivariate data has regression equation y = a + bx, then the residual of the data 
point (x; ,) is given by y;—- (a + bx,). The sum of the residuals of all data points is 0. 


Consider the following data set: x 1 2 4 6 7 
fae 12] 17 | 31] 52] 58 
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Linear regression 


The equation of the regression line of y on x is y = 0.2 + 0.8x. 


You can calculate the residuals for each data point and record them in a table: 


The residuals can be plotted on a residual 
plot to show the trend: 


The distribution of the residuals around zero 
is a good indicator of linear fit. You would 
expect the residuals to be randomly scattered 
about zero. If you see a trend in the residuals, 
you would question the appropriateness of 
the linear model. 


Non-random residuals might follow an 

increasing pattern, a decreasing pattern or 
an obviously curved pattern. Here are three 
examples of residual patterns which might 
indicate that a linear model is not suitable: 


EA € 


x | yp y =0.2 + 0.8x € 
2 |) Li 1.8 -0.1 
A.) 3d 3.4 —0.3 
t | 58 5.8 0 


é€ 
04 
03 
0.2 x 
0.1 
O x 
0.1 
-0.2 
03 
0.4 


2) 
xv 


The table shows the relationship between the temperature, t°C, and the sales of ice cream, s, on five 


days in June: 


Temp, t(°C) 


15 


16 


18 


19 


21 


Sales, s (100s) 


12.0 


15.0 


17.5 


P 


24.0 


Online ) Explore residuals of data ec? 


points and reasonableness of fit 


: ; ; _ using GeoGebra. 
The equation of the regression line of s on ¢ is given as 


s =-17.154 + 1.96937. 
a Calculate the residuals for the given regression line and hence find the value of p. 


b By considering the residuals, comment on whether a linear regression model is suitable for 
these data. 


Chapter 1 


alt |s $ =-17.154 + 1.9693t |e 
15 |12.0 112.3855 -0.3855 
16 |15.0 14.3548 0.6452 
1817.5 |18.2934 -~0.7934 
19 |p 202027 p= 20.2627 
21 | 24.0 243018 02012 


-0.3855 + 0.6452 - 0.7934 + (p - 20.2627) - 0.2013 =O 
=p= 21.0 (3°s4) 
The residual for t= 19 is 0.7373. 


b The residuals appear to be randomly distributed around 
zero therefore it is likely that the linear regression model is 
suitable. 


You can use residuals to identify possible outliers. 


Example 6) 
The table shows the time taken, ¢ minutes, to produce t|21137 148 | 61 172 
OS OF Darntsa a taeeory: y | 19.2 | 27.3 | 26.9 | 38.5 | 40.9 


The regression line of y on fis given as y = 9.7603 + 4.35141. 

One of the y-values was incorrectly recorded. 

a Calculate the residuals and write down the outlier. 

b Comment on the validity of ignoring this outlier in your analysis. 
c Ignoring the outlier, produce a new model. 


d Use the new model to estimate the amount of paint 
that is produced in 4.8 minutes. 


2A | tae | 1a6962 0.3016 

SY | 273'| 25.0605 14395 Problem-solving 

eo | Soe Sto You could also say that the data point is 

6.1 | 36.5 | 36.3036 2.1962 a valid piece of data so it should be used, 

7.2 | 40.9 | 41.0904 -0.1904 or that there are only five data points so 
fhe Weseee bes OES. s you should retain them all. You can make 


any reasonable conclusion as long as you 


b The residuals suggest that this data point does give a reason. 


not follow the pattern of the rest of the data, so 


it is valid to remove it. 
c New model: y= 10.669 + 4.3573t 
d y=10.669 + 4.3573 x 4.6 = 31.6 litres (3 s.f) ‘a 
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Linear regression 


It is often useful to have a numerical value to indicate how closely a given set of data fits a linear 
regression model. Because the sum of the residuals is 0, you find the square of each residual and 
work out the sum of these values. This is called the residual sum of squares (RSS). 


= You can calculate the residual sum of 
squares (RSS) for a linear regression 
model using the formula 


Sas In your A-level course, you used the 
product moment correlation coefficient, r, 


to measure the strength and type of linear 
correlation. < SM2, Section 1.2 


Note ] The formula is given in the formulae booklet. 
You are not expected to be able to derive it. 


RSS = S,,,- 


The linear regression model is the linear model 
which minimises the RSS for a given set of data. 
Unlike the product moment correlation coefficient, 
which takes values between —1 and 1, the units 

of the RSS are same as the units of the response 
variable squared. For this reason, you should only 
use the RSS to compare goodness of fit for data 
recorded in the same units. 


The data shows the sales, in 100s, y, of S/ush at a riverside café and the number of hours of 
sunshine, x, on five random days during August. 

x 8 10 11.5 12 12.2 
y 7A 8.2 8.9 9.2 9.5 


Given that ox = 53.7. Siy=42.9 Sox?2=589.09 YS oy2=371.75 Soxy = 467.45 
a calculate the residual sum of squares (RSS). 
The RSS for five random days in December is 0.0562. 


b State, with a reason, which month is more likely to have a linear fit between the number of hours 
of sunshine and the sales of Slush. 


The RSS is also linked to the product moment 
correlation coefficient, r, by the equation 
RSS='5,, (1-9) — Section 2.1 


2 = 
(9) 42,92 
a Sele a= Se 
= 3.666 
a 
x 2 
Scie erties 
=12.352 
xD 
Da = oxy i n 
~aepes = ee : SE? = 6.704 
(Sry)? 6.7042 
RSS = Si, - .. 3.668 - 1235)" 
= 0.0294 (3 s.f) 
b 0.0294 < 0.0562 therefore August is 
more likely to have a linear fit. 
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Exercise aie 


1 The table shows the relationship between two variables, x and y: 


x | 11/13] 14/17] 19 
y | 122] 145] 169] p | 23.5 


The equation of the regression line of y on x is given as y = —3.633 + 14.33x. 
Calculate the residuals for the given regression line and hence find the value of p. 


2 The table shows the masses of six baby elephants, mkg, against the number of days premature 
they were born, x. 

x 2 5 8 9 11 15 

m 110 |} 105 | 103 | 101 96 88 


The equation of the regression line of m on x is given as m = 114.3 — 1.655x. 
a Calculate the residual values. (2 marks) 


b Draw a residual plot for this data. (2 marks) {Hint ) There is an example of a 


: : residual plot on page 11. 
c With reference to your residual plot, comment on the E Ee 


suitability of a linear model for this data. (1 mark) 


3 Sarah completes a crossword each day. She measures both the time taken, ¢ minutes, and the 
accuracy of her answers, given as a percentage, p. She records this data for 10 days and the 
results are shown in the table below: 


t 5.1 5.7 6.3 6.4 7.1 12 8.0 8.3 8.7 9.1 


Pp 79 81 85 86 89 84 95 96 98 99 
The regression line of p on fis given as p = 51.04 + 5.3087. 
a Calculate the residuals and use your results to identify an outlier. (3 marks) 
b State, with a reason, whether this outlier should be included in the data. (1 mark) 
c Ignoring the outlier, produce another model. (2 marks) 


d Use this model to predict the percentage of correct answers if the crossword takes 
Sarah 7.8 minutes to complete. (1 mark) 


4 The table shows the age, x years, of a particular model of car and the value, y, in £1000s. 


x 1.2 1.7 2.4 3.1 3.8 4.2 5.1 
y 13.1 12.5 10.9 9.4 7.9 a 5.8 


The regression line of y on x is given as y = 15.7 — 2.02x. 


a Calculate the residuals and hence find the value of a, correct to three significant figures. 


(3 marks) 
b By considering the signs of the residuals, explain whether or not the linear regression model 
is suitable for this data. (1 mark) 
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The table shows the ages of runners, a, against the times taken to complete an obstacle course, 
t minutes. 

a 17 19 22 25 30 38 41 44 
18 19 21 22 25 28 29 31 


Yoa=236 = S°t=193 Sla2?=7720 dot? =4821 Scat = 6046 

a State what is measured by the residual sum of squares. (1 mark) 
b Calculate the residual sum of squares (RSS). (5 marks) 
The runners then complete a cross-country course. The RSS for the new set of data is 1.154. 


c State, with a reason, which data is more likely to have a linear fit. (1 mark) 
The table shows the amount of rainfall, dmm, against the relative humidity, 4%, for Stratford- 
upon-Avon on 7 random days during September. 


h 67 69 74 77 79 81 87 
d 1.3 1.7 1.9 2.0 2.2 2.4 3.1 


Given that Sn = 289.4 Sid = 1.949 Sha = 23.13 


a calculate the residual sum of squares (RSS). (2 marks) 
The RSS for a random sample of 7 days in October is 0.0965. 
b State, with a reason, which sample is more likely to have a linear fit. (1 mark) 


A particular model of car depreciates in value as it gets older. The table below shows the ages, 
x years, and the values, y £1000s of a random sample of these cars. 


x 0.7 1.3 1.8 23 2.9 3.8 
y 15.4 13.5 12.1 10.1 8.5 5.8 
yYox=128 DdYiy=65.4 >ox?=33.56 Soy?=773.72 Dd oxy = 120.03 
a Calculate the equation of the regression line of y on x, giving your answer 


in the form y = a + bx. Give the values of a and / correct to 4 significant figures. (3 marks) 
b Give an interpretation of the value of a. (1 mark) 
c Use your regression line to estimate the value of a car that is 2 years old. (1 mark) 
d Calculate the values of the residuals. (2 marks) 
e Use your answer to part d to explain whether a linear model is suitable for these data. (1 mark) 


f Calculate the residual sum of squares (RSS). (2 marks) 
A sample for a second model of car has an RSS of 0.2548. 
g State, with a reason, which sample is more likely to have a linear fit. (1 mark) 


Challenge 


The table shows the relationship between two variables, x and y. 


aw | i | & | fF 
» | 2 | 2 | G@ 


Given that the equation of the regression line of y on x is y=2 + 4x, 
Find the values of p and gq. 
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Two variables s and t are thought to be connected by an equation of the form t = a + bs, where 
aand b are constants. 


a Use the summary data 


> 3s = 553 dot = 549 dist = 31185 n=12 5 = 46.0833 

t = 45.75 S,,= 6193 

to work out the regression line of tons. (3 marks) 
b Find the value of ¢t when s is 50. (1 mark) 


A biologist recorded the breadth (xcm) and the length (vcm) of 12 beech leaves. 
The data collected can be summarised as follows. 


yx 297.73 Sx = 33.1 diy = 66.8 Yexy = 195.94 
a Calculate S,.and S,,.. (2 marks) 
b Find the equation of the regression line of y on x in the form y = a+ bx. (3 marks) 
c Predict the length of a beech leaf that has a breadth of 3.0cm. (1 mark) 


Energy consumption is claimed to be a good predictor of Gross National Product. An economist 
recorded the energy consumption (x) and the Gross National Product (y) for eight countries. 
The data are shown in the table. 


Energy consumption, x 3.4 7.7 12.0 75 58 67 113 131 
Gross National Product, y 55 240 390 1100 1390 1330 1400 1900 
a Calculate S,,, and S,... (2 marks) 
b Find the equation of the regression line of y on x in the form y =a + bx. (3 marks) 
c Estimate the Gross National Product of a country that has an energy consumption 
of 100. (1 mark) 
d Estimate the energy consumption of a country that has a Gross National Product 
of 3500. (1 mark) 
e Comment on the reliability of your answer to d. (1 mark) 


In an environmental survey on the survival of mammals, the tail length t(cm) and body length 
m(cm) of a random sample of six small mammals of the same species were measured. 


These data are coded such that x = a and y=t-2. 
The data from the coded records are summarised below. 
> y = 13.5 yx = 25.5 oxy = 84.25 S,.. = 59.88 


a Find the equation of the regression line of y on x in the form y = ax + b. (3 marks) 
b Hence find the equation of the regression line of t on m. (2 marks) 
c Predict the tail length of a mammal that has a body length of 10cm. (2 marks) 


Linear regression 
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A sports scientist recorded the number of breaths per minute (7) and the pulse rate per minute 
(p) for 10 athletes at different levels of physical exertion. The data are shown in the table. 


= — 50 
The data are coded such that x = a and y = — 


2 

x 3 5 5 7 8 9 9 10 12 13 

y 4 9 10 11 17 15 17 19 22 27 
(You may use} x=81 9x?=747 SYiy=151 Soy? =2695 oxy = 1413) 
a Calculate S,,, and S,.. (2 marks) 
b Find the equation of the regression line of y on x in the form y =a + bx. (3 marks) 
c Find the equation of the regression line for p on r. (2 marks) 
d Estimate the number of pulse beats per minute for someone who is taking 22 breaths per 

minute. (2 marks) 
e Comment on the reliability of your answer to d. (1 mark) 


A farm food supplier monitors the number of hens kept (x) against the weekly consumption 
of hen food (vy kg) for a sample of 10 small holders. He records the data and works out the 
regression line for y on x to be y = 0.16 + 0.79x. 


a Write down a practical interpretation of the figure 0.79. (1 mark) 
b Estimate the amount of food that is likely to be needed by a small holder who has 

30 hens. (2 marks) 
c If food costs £12 for a 10 kg bag, estimate the weekly cost of feeding 50 hens. (2 marks) 


Water voles are becoming very rare. A naturalist society decided to record details of the water 
voles in their area. The members measured the mass (y) to the nearest 10 grams, and the body 
length (x) to the nearest millimetre, of eight active healthy water voles. The data they collected 
are in the table. 


Body length, x (mm) 140 150 170 180 180 200 220 220 

Mass, y (grams) 150 180 190 220 240 290 300 310 
a Draw a scatter diagram of these data. (2 marks) 
b Give a reason to support the calculation of a regression line for these data. (1 mark) 
c Use the coding /= Ta and w = a to work out the regression line of w on /. (3 marks) 
d Find the equation of the regression line for y on x. (2 marks) 
e Draw the regression line on the scatter diagram. (1 mark) 
f Use your regression line to calculate an estimate for the mass of a water vole that has a body 


length of 210mm. Write down, with a reason, whether or not this is a reliable estimate. (2 marks) 
The members of the society remove any water voles that seem unhealthy from the river and take 
them into care until they are fit to be returned. 
They find three water voles on one stretch of river which have the following measurements. 
A: Mass 235 g and body length 180 mm 
B: Mass 180 g and body length 200 mm 
C: Mass 195 g and body length 220 mm 
g Write down, with a reason, which of these water voles were removed from the river. (1 mark) 
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A mail order company pays for postage of its goods partly by destination and partly by total 
weight sent out on a particular day. The number of items sent out and the total weights were 
recorded over a seven-day period. The data are shown in the table. 


Number of items, 1 10 13 22 15 24 16 19 
Weight, w (kg) 2800 3600 6000 3600 5200 4400 5200 
a Use the coding x =n-10 and y= 06 to work out S,,, and S,... (4 marks) 
b Work out the equation of the regression line for y on x. (3 marks) 
c Work out the equation of the regression line for w on 7. (2 marks) 
d Use your regression equation to estimate the weight of 20 items. (2 marks) 
e State why it would be unwise to use the regression equation to estimate the weight of 
100 items. (1 mark) 
f Use your equation of the regression line found in part b to work out the residuals of the 
coded data points (x, y). (2 marks) 
g Use your equation of the regression line found in part ¢ to work out the residuals of the 
original data points (n, w). (2 marks) 
h Explain how your answers to parts f and g are related to the coding used. (1 mark) 


The table shows the time, ¢ hours, against the temperature, T°C, of a chemical reaction. 


t 2 3 5 6 7 9 10 
T 72 68 59 54 50 42 38 


Given that the equation of the regression line of T on t is T = 80.445 — 4.2892, 


a calculate the residual values. (2 marks) 
b State, with a reason, whether a linear model is suitable in this case. (1 mark) 
Given that S,, = 52, S77 = 957.43 and S,7 = —223, 
c calculate the residual sum of squares (RSS). (2 marks) 
A second chemical reaction has a RSS of 0.8754. 
d State, with a reason, which reaction is most likely to have a linear fit. (1 mark) 


A meteorologist is developing a model to describe the relationship between the number of 
hours of sunshine, s, and the daily rainfall, fmm, in summer. 


A random sample of the number of hours sunshine and the daily rainfall is taken from 8 days 
and are summarised below: 


dos=53.4 dos2= 395.76 Yof=29.9 So f2=131.93 Sosf= 171.66 

a Calculate S,, and S,y. (2 marks) 
b Find the equation of the regression line of fons. (3 marks) 
c Use your equation to estimate the daily rainfall when there is 7.5 hours of sunshine. (1 mark) 


d Calculate the residual sum of squares (RSS). (3 marks) 
The table shows the residual for each value of s. 

5 a4 4.2 5.4 6.2 71 8.8 9.1 9.5 

Residual | —0.177 | -—0.196 | 0.256 0.124 x —0.129 | -0.216 | -—0.032 


Linear regression 
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e Find the value of x. (2 marks) 


f By considering the signs of the residuals, explain whether or not the linear regression 
model is suitable for these data. (1 mark) 


A random sample of 9 baby southern hairy-nosed wombats was taken. The age, x, in days, and 
the mass, y grams, was recorded. The results were as follows: 


Pe >2t3s tats toe Tz Ts To To] 
y 4 P) a. 8 9 11 12 11 15 

(You may use S,,=60 S,,=98.89  S,,=75) 

a Find the equation of the regression line of y on x in the form y = a + bx as a model for 


these results. Give the values of a and / correct to three significant figures. (2 marks) 
b Show that the residual sum of squares is 5.14 to three significant figures. (2 marks) 
c Calculate the residual values. (2 marks) 
d Write down the outlier. (1 mark) 
e i Comment on the validity of ignoring this outlier. 


ii Ignoring the outlier, produce another model. 
iii Use this model to estimate the mass of a baby wombat after 20 days. 
iv Comment, giving a reason, on the reliability of your estimate. (5 marks) 


The annual turnover, £tmillion of eight randomly selected UK companies, and the number of 
staff employed in 100s, s, is recorded and the data shown in the table below: 


t, £million 1.2 1.5 1.8 2.1 2.5 27 2.8 3.1 
s, 00s 1.1 1.4 1.7 2.2 2.4 2.6 2.9 3.2 


Olt=17.7 Sos=17.5  9022=42.33) S0s2=42.07 Sots = 42.16) 
a Calculate the equation of the regression line of s on ¢, giving your answer in the form 


s=a+ bt. Give the values of a and 4 correct to three significant figures. (3 marks) 
b Use your regression line to predict the number of employees in a UK company with an 
annual turnover of £2 300000. (2 marks) 
The table shows the residuals for each value of f: 
t 1.2 1.5 1.8 2.1 25 2.7 2.8 3.1 
Residual 0.0121 -0.0137 | —0.0395 | 0.1347 | -0.0997 Pp 0.0745 0.0487 
c Find the value of p. (2 marks) 
d By considering the signs of the residuals, or otherwise, comment on the suitability of the 
linear regression model for these data. (1 mark) 
e Calculate the residual sum of squares (RSS). (2 marks) 


A random sample of equivalent companies in France is taken and the residual sum of squares 
is found to be 0.421. 


f State, with a reason, which sample is likely to have the better linear fit. (1 mark) 
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Challenge 


A set of bivariate data (x1, 1), (2, ¥2), (%3, 3), --- (XM V,) is modelled 


by 
a 
b 


the linear regression equation y = a+ bx, wherea= y — bX. 
Prove that the sum of the residuals of the data points is 0. 


By means of an example, or otherwise, explain why this condition 
does not guarantee that the model closely fits the data. 


Summary of key points 
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The residual of a given data point is the difference between the observed value of the 
dependent variable and the predicted value of the dependent variable. 


The equation of the regression line of y on x is: 
y=atbx 


Sg 
where b = —anda= y — bx 


xx 


Sey 
Say = Ey, i n 
BXS @ 
Dx = Dee a Bar 
Dye 
Sy= Ly 
If a set of bivariate data has regression equation y = a + bx, then the residual of the data point 
(x; y;) is given by y;- (a + bx;). The sum of the residuals of all data points is 0. 
You can calculate the residual sum of squares (RSS) for a linear regression model using the 
formula 
(Sry)? 
RSS = S\,,— ©. 


Correlation 


After completing this chapter, you should be able to: 


@ Calculate the value of the product moment correlation coefficient, 
understand the effect of coding on it and understand the conditions 
for its use — pages 22-26 


@ Calculate and interpret Spearman’s rank correlation coefficient ~-> pages 26-32 


@ Carry out hypothesis tests for zero correlation using either Spearman's 
rank correlation coefficient or the product moment correlation coefficient 


— pages 33-38 


Prior knowledge check 


1 The scatter diagram shows data on the 
age and value of a particular model of car. 


Value (£1000s) 


0 
QOlAsAS or B Y iO 
Age (years) 


a Describe the type of correlation shown. 


b Interpret the correlation in context. 
< SM1, Chapter 4 


Spearman's rank correlation coefficient can 
be used to determine the degree to which 
two ice skating judges agree or disagree 
about the relative performance of the 
skaters. It considers rankings, rather than 
specific values. — Exercise 2B Q8 a S,, b S,,, c S,, © Chapter 1 


2 Aset of 8 bivariate data points (x, y) is 
summarised using >_x = 48, )_y = 92, 
dix? = 340, Soy? = 1142 and S>xy = 616. 


Calculate: 


Chapter 2 


[2.1 | The product moment correlation coefficient 


The product moment correlation coefficient 
(PMCC) measures the linear correlation 
between two variables. The PMCC can take values 
between 1 and —1, where 1 is perfect positive 
linear correlation and —1 is perfect negative 
linear correlation. 


In Chapter 1 you used the summary Statistics 
Sy Sy, and S',,, to calculate the coefficients of a 
regression line equation and to calculate the 


t Watch out | The PMCC was designed to analyse 


continuous data that comes from a population 
having a bivariate normal distribution. 

(That is, when considered separately, both the 

x and y data sets are normally distributed.) 

For data sets which do not satisfy these 
conditions, other correlation coefficients may be 
more valid. — Section 2.2 


residual sum of squares (RSS). You can also use these summary Statistics to calculate the product 


moment correlation coefficient. 


= The product moment correlation coefficient, ' Online ) 


r, is given by 
Syy 


Sex Syy 


Sometimes the original data is coded to make 

it easier to manage. Coding affects different 
statistics in different ways. As long as the coding 
is linear, the product moment correlation 
coefficient will be unaffected by the coding. 


This formula is given in 
the formulae booklet. 


Examples of linear coding of a data set x; are p; = ax;+ b and p;= 


Explore linear correlation between 
two variables, measured by the PMCC, using 
GeoGebra. 


You can calculate the value of the PMCC 


from raw data using your calculator. 
<€ SM2, Section 1.2 


X,—a 


b 


You can think of linear coding as a change in scale on the axes of a scatter graph. 


A 
a xX 
x 


*% 


xX 


O x 

Raw data is often tightly grouped or contains very 
small or very large values. Changing the scale 
(which is equivalent to linear coding) can make 
the scatter graph easier to read. 
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0 P 


The degree of linear correlation is unaffected 
by the change of scale. The value of r for the 
uncoded and coded values will be the same. 


Example 


The number of vehicles, x millions, and the number of accidents, 
y thousands, in 15 different countries were recorded. The following 
summary Statistics were calculated and a scatter graph of the data is 


given to the right: 


Sx =176.9 Siy=679 Sox? = 2576.47 Soy2= 39771 Sixy = 9915.3 = 
a Calculate the product moment correlation coefficient between x and y. 


b With reference to your answer to part a and the scatter graph, 
comment on the suitability of a linear regression model for these data. 


2 
x 
2 s.2ee 2, 
2 
= 2576.47 - 82" - 490.23 
(iy) 
Sy = yi = n 
2 
= 39771- 2" = 9034.93 
Sy = Dxy- 2” 
= 9915.3 - LEE XE!9 _ 1907.63 
Sig 
he 
Vf SxxSyy 
1907.63 
= = 0.906 (3 sf) —-—J 
J490.23 x 9034.93 i 
b From the scatter graph, the data 
appear to be linearly distributed and the 
correlation coefficient calculated in part a 
is close to 1, so a linear regression model 
appears suitable for these data. 


Correlation 


YA 
150 
125 
100 Xx 
x 
ex 
25 xx ' 
O 5 10 15 20 25 30 ¥ 


Use the standard formula given in the formulae 
booklet. 


The value of the correlation coefficient is 0.906. 
This is a positive correlation. 

The greater the number of vehicles the higher the 
number of accidents. 


Data are collected on the amount of dietary supplement, d grams, given to a sample of 8 cows and 


their milk yield, m litres. The data were coded using x = g —6andy= om The following summary 


statistics were obtained: 
Sod? =4592 S4,=90.6 > x=44 
a Use the formula for S,,,, to show that S,,,,, 


S,y = 0.05915 


= 23.66. 


b Find the value of the product moment correlation coefficient between d and m. 
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al 
m \2 20 
Sy = X(50) - 8 
1 
1 700 (22) 
COI Haq." = : 
2 
- (S-m) 
= 700\ Lm? — 3 


1 
™ A400 Sai 


Hence Sim, = 0.05915 x 400 = 23.66 ——-__ 


Substitute the code for y into the formula for S,,,. 


Problem-solving 


If you take a factor of 


al ele 


202 400 
on the right-hand side you are left with the 


formula for S,,,,: 


out of each term 


Substitute and simplify to find the value of S,,,,,. 


Substitute the code into }_x and rearrange to 


b Yx=($-6) 4 44=40d-3xe 
Hence )_d = 184- 
Speasoe 19 . 660. 
= — = 0.982 (3 s#) 


Exercise 


! 


find }cd. 


LL Find S,q using the standard formula. 


1 Given that S,.. = 92, S,,, = 112 and S,,, = 100 find the value of the product moment correlation 


coefficient between x and y. 


2 Given the following summary data, 
Yox=367 Soy=270 Dox? = 33845 


Sy? = 12976 


Yoxy=17135 n=6 


calculate the product moment correlation coefficient, r, using the formula 


Dag 


Bsa 


summarised as follows: 
Yva=115 Sa = 1899 
a Find S,.. 


Shh = 571.4 


b Find the value of the product moment correlation coefficient between a and h. 


The ages, a years, and heights, /cm, of seven members of a team were recorded. The data were 


Sop = 72.1 


(1 mark) 
(1 mark) 


c Describe and interpret the correlation between the age and height of these seven people 


based on these data. 


(2 marks) 


In research on the quality of bacon produced by different breeds of pig, data were obtained 


about the leanness, /, and taste, t, of the bacon. The data are shown in the table. 


Leanness, / 1,5 2.6 3.4 


5.0 


6.1 8.2 


Taste, t aa) 5.0 7.7 


9.0 


10.0 10.2 


a Find S;, S,,and S;,. 


(3 marks) 


b Calculate the product moment correlation coefficient between / and ¢ using the values 


found in part a. 
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(2 marks) 


Correlation 


A scatter graph is drawn of the data. th 
c With reference to your answer to part b and the eal g 
scatter graph, comment on the suitability of a linear al x * 
regression model for these data. (2 marks) al x 
64 x x 
4-4 
2 4 
T T T T T T T T > 
O 12345678 92 
Eight children had their IQ measured and then took a general knowledge test. 
Their IQ, x, and their marks, y, for the test were summarised as follows: 
xaos Soe =120109 Siy=400 3277 =33000 xp = 61595, 
a Calculate the product moment correlation coefficient. (3 marks) 
b Describe and interpret the correlation coefficient between IQ and general knowledge. (2 marks) 


Two variables, x and y, were coded using A = x - 7 and B= y- 100. 
The product moment correlation coefficient between A and B is found to be 0.973. 
Find the product moment correlation coefficient between x and y. 


The following data are to be coded using the coding p = x and g = y - 100. 


x 0 5 3 2 1 
y 100 117 112 110 106 


a Complete a table showing the values of p and q. 


b Use your values of p and g to find the product moment correlation coefficient between p and gq. 


c Hence write down the product moment correlation coefficient between x and y. 


The product moment correlation is to be worked out for the following data set using coding. 
x 50 40 55 45 60 
y 4 3 5 4 6 

a Using the coding p = ;and t = y find the values of S,,, S,,and S,,. 


b Calculate the product moment correlation coefficient between p and ¢. 
c Write down the product moment correlation coefficient between x and y. 


A shopkeeper thinks that the more newspapers he sells in a week the more sweets he sells. He 
records the amount of money (m pounds) that he takes in newspaper sales and also the amount 
of money he takes in sweet sales (s pounds) each week for seven weeks. The data are shown in 


the following table. 
Newspaper sales, m pounds 380 402 370 365 410 392 385 
Sweet sales, s pounds 560 543 564 573 550 544 530 


a Use the coding x = m - 365 and y = s — 530 to find S,,,, S,, and S,,. 
b Calculate the product moment correlation coefficient for m and s. 
c State, with a reason, whether or not what the shopkeeper thinks is correct. 


(4 marks) 
(1 mark) 
(1 mark) 
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10 A student vet collected 8 blood samples y 


from a horse with an infection. For x 

each sample, she recorded the amount x 

of drug, f, given to the horse and the > og? = 74458.75 
amount of antibodies present in the * Sj = 5667.5 


blood, g. She coded the data using 
f= 10x and g = 5(y + 10) and drew a 
scatter diagram of x against y. x Sy, = 111.48 


x Soy = 70.9 


O x 


Unfortunately, she forgot to label the axes on her scatter diagram and left the summary data 
calculations incomplete. Problem-solving 

A second student was asked to complete the analysis of the data. Substitute the code into 
a Show that S,= 11148. (3 marks) the formula for S,.. 


b Find the value of the product moment correlation coefficient between fand g. (4 marks) 
c With reference to the scatter diagram, comment on the result in part b. (1 mark) 


11 Alice, a market gardener, measures the amount of fertiliser, x litres, that she adds to the 
compost for a random sample of 7 chilli plant beds. She also measures the yield of chillies, ykg. 
The data are shown in the table below: 


x, litres 1.1 1.3 1.4 1.7 1.9 2.1 2.5 
y, kg 6.2 | 10.5 12 15 17 18 19 
(Sox=12 Sox? =22.02 Yy=97.7 Yy?= 1491.69 Soxy= 180.37) 


a Show that the product moment correlation coefficient for these data is 0.946, correct to three 
significant figures. (4 marks) 


The equation of the regression line of y on x is given as y = —-1.2905 + 8.8945x. 


b Calculate the residuals. (3 marks) 
Alice thinks that because the PMCC is close to 1, a linear relationship is a good model for 

these data. 

c With reference to the residuals, evaluate Alice’s conclusion. (2 marks) 


GA) Spearman’s rank correlation coefficient 


In cases where the correlation is not linear, or where the data are not measurable on a continuous 
scale, the PMCC may not be a good measure of the correlation between two variables. 


For example, suppose a manufacturer of tea produced a number of different blends; you could taste 
each blend and place the blends in order of preference. You do not, however, have a continuous 
numerical scale for measuring your preference. Similarly, it may be quicker to arrange a group of 
individuals in order of height than to measure each one. Under these circumstances, Spearman’s rank 
correlation coefficient is used. 
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Spearman’s rank correlation coefficient is 
denoted by r,. In principle, r, is simply a special 
case of the product moment correlation 
coefficient in which the data are converted to 
rankings before calculating the coefficient. 


= To rank two sets of data, X and Y, you give 
the rank 1 to the highest of the x; values, 2 
to the next highest, 3 to the next highest, 
and so on. You do the same for the y; values. 


In general, you might choose to use Spearman's rank correlation coefficient rather than the product 


Note | Spearman’s rank correlation coefficient 
is sometimes used as an approximation for the 
product moment correlation coefficient as it is 
easier to calculate. 


Note ) It makes no difference if you rank the 
smallest as 1, the next smallest as 2, etc., 
provided you do the same to both X and Y. 


moment correlation coefficient if one of the following conditions is true: 


e one or both data sets are not from a normally distributed population 


e there is a non-linear relationship between the two data sets 


¢ one or both data sets already represent a ranking (as in Example 3 below). 


Example 


Two tea tasters were asked to rank nine blends of tea in their order of preference. The tea they 
liked best was ranked 1. Their orders of preference are shown in the table: 


Blend A B Cc D E F G H I 
Taster 1 (x) 3 6 2 8 5 9 7 1 4 
Taster 2 (y) 5 6 4 2 7 8 9 1 3 


Calculate Spearman’s rank correlation coefficient for these data. 


Correlation 


Spearman's rank 
correlation coefficient 
using GeoGebra. 


Xj Ji LF yi? XiVi 

3 2, ] 25 15 

6 6 36 36 36 

e 4 4 1G 8 

‘e) 2 64 4 16 

3 ye 25 Ag 35 

9 8 roy 64 72 

re a Ag roy 63 

| 1 | 1 | 

4 fe) 16 2] 12 
245 | Seas | bah aees | Lyra 2e5 | Saxe ee 

. Se pe PR, iD 
BTS Ey 
258 - 45445 - 

= =:0;55 


(285 - 45 ¥45)(285 _ 45 = V60 x 6O 


Find x‘, yé and x;y; 


Find 
| ae ey Ba Sy 
and xy 


Use the standard formula 
to calculate r, 
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Since Spearman's rank correlation coefficient is derived from the PMCC: 


= A Spearman's rank correlation coefficient of: 
¢ +1 means that rankings are in perfect agreement 
¢ -1 means that the rankings are in exact reverse order 
¢ O means that there is no correlation between the rankings. 


You can calculate Spearman's rank correlation coefficient more quickly by looking at the differences 
between the ranks of each observation. 


= If there are no tied ranks, Spearman’s rank : 
2 ‘ P . Tied ranks occur when two 
correlation coefficient, r,, is calculated using 
or more data values in one of the data sets 


awe 6 sae are the same. If there are only one or two 

. n(n? — 1) tied ranks, this formula gives a reasonable 
estimate for r, but if there are many tied 
ranks then you should use the PMCC formula 
with the ranked data. 


where d is the difference between the ranks 
of each observation, and 7 is the number 
of pairs of observations. 


Example @ 


During a cattle show, two judges ranked ten cattle for quality according to the following table. 


Cattle A B Cc D E ji G HT I J 
Judge A 1 5 2 6 4 8 3 7 10 9 
Judge B 3 6 2 7 5 8 1 4 9 10 


Find Spearman’s rank correlation coefficient between the two judges and comment on the result. 


7A 3B 7 7 In this case the data are already ranked, and 
there are no tied ranks. 
i 3 -2 4 bi 
5 G -| — Find dand d? for each pair of ranks. 
2 2 O O 
6 7 -| 
4 5 -| 
8 8 O 
3 j a 4 
7 4 3 9 
10 2] ] 
) 10 -| 
Total 22 — Find d-d2. 
6x22 
"= 1 7OG00 = 7) = O86? ~ L- alculate r,. 
There is a reasonable degree of agreement ——_—— 
between the two judges. — Draw a conclusion. 
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If you are ranking data, and two or more data values are equal, then these data values will have a 


tied rank. 
= Equal data values should be assigned a rank equal to the mean of the tied ranks. 
For example: 
Data value 200 350 350 400 700 800 800 800 | 1200 
Rank 1 25 2.5 4 2) 7 7 7 9 


The 2nd and 3rd rank are tied, so assign a 


rank of 2.5 to each of these data values. 


of 


3 


| 
The 6th, 7th and 8th ranks are tied, so assign a rank 


(smal = 7 to each of these data values. 


When ranks are tied, the formula for the Spearman’s rank correlation coefficient only gives an 
approximate value of r,. This approximation is sufficient when there are only a small number of tied 
ranks. If there are many tied ranks, then you should use the PMCC formula with the ranked data. 


Example (5) 


The marks of eight pupils in French and German tests were as follows: 


to find an estimate for Spearman’s rank correlation 


A B Cc D E F G H 
French, f% 52 | 25 86 33 55 55 54 | 46 
German, g¢% 40 | 48 65 57 | 40 | 39 63 34 
6) od? 
a Use the formula r, = 1 -—>—~ 
n(n? — 1) 


coefficient, showing clearly how you deal with tied ranks. Give your answer to 2 decimal places. 


b Without recalculating the correlation coefficient, state how your answer to part a would change if: 


i pupil H’s mark for German was changed to 38% 
ii a ninth pupil was included who scored 95% in French and 89% in German. 


The teacher collects extra data from other students in the class and finds that there are now many 


tied ranks. 


c Describe how she would now find a measure of the correlation. 


a) f | g | Rank, f | Rank, g |] d d2 

52 | 40 4 3.54/05] 0.25 
18 1 
66 | 65 8 8 O O 
33 | 57 2 G =4 1G 
55 | 40 654] 3.5] 3 9 
55 | 39 6.5 2 45 | 20.25 
54 | 63 7 = 4 
AG | 34 3 1 2 4 
ea = 635 7 
6 x 69.5 
r=i- 8(82 — 1) = 0.17 (2 dp) 


Both sets of data should be ranked in the 
same order, in this case smallest to largest. 


and 4. 


and 7. 


Since the two 40s are tied in rank, both data 
points are allocated the mean of ranks 3 


Since the two 55s are tied in rank, both data 
points are allocated the mean of ranks 6 


Calculate d¢ and use the formula for r,. 
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b i Pupil rank does not change so no effect on r,. 


c Give the mean of the tied ranks to each pupil 


Pupil will have the same rank for both tests 
(9) and hence d =O. 

This means >_d? will not change but since n 
has increased by 1, the denominator is bigger 


so Ir, will increase. 


and calculate the PMCC directly from the 
ranked data, rather than using the Spearman's 
rank correlation coefficient formula. 


@) 1 


325 
The scatter graph shows the length, /m, and mass, mkg, of 320 7 x 
10 randomly selected male Siberian tigers. 315 
A student wishes to analyse the correlation between/andm. 310 x é 
Give one reason why the student might choose to use: 305 x x 
a the product moment correlation coefficient 300 x eS 
b Spearman’s rank correlation coefficient. x 


@) 3 


295 
3.05 3.1 3.15 3.2 3.25 3.3 3.35 3.4 / 


A college is trying to determine whether a published placement test (PPT) gives a good indicator 
of the likely student performance in a final exam. Data on past performances are shown in the 
scatter graph: 


90 
80 
70 
60 x 
50 
40 
30 


Final exam percentage 


20 
200 205 210 215 220 225 230 235 240 
PPT score 
Give a reason why the college should not use the product moment correlation coefficient to 
measure the strength of the correlation of the two variables. 


A sports science researcher is investigating whether there is a correlation between the height of a 
basketball player and the number of attempts it takes them to score a free throw. The researcher 
proposes to collect a random sample of data and then calculate the product moment correlation 
coefficient between the two variables. 


Give a reason why the PMCC would not be appropriate in this situation and state an alternative 
method that the researcher can use. 
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alr, 1 2 3 
ry | 3 | 2) 1 
bir | 1 |] 2 | 3 10 
r, | 2 4 10 
Clr, 5 2 6 
r, | 5 | 6 | 3 


a 


> 


we TFN WwW RUDAA WL 


84 x 


7 x 


rs 


-1 


6 > 
0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 *¥ 


r,=0.9 


b 


YA 


4 For each of the data sets of ranks given below, calculate the Spearman’s rank correlation 
coefficient and interpret the result. 


5 Match the scatter graphs with the given values of Spearman’s rank correlation coefficient. 


YA 


x 
0 42 44 46 5.0 
x x 
x 
> 
1 2 3 5 * 
r,=l r,=0.5 


@) 6 The number of goals scored by football teams and their positions in the league were recorded 
as follows for the top 12 teams. 


Team A B C | D E F G| dH I J L 
Goals 49 | 44 | 43 | 36 | 40 | 39 | 29 | 21 | 28 | 30 26 
League position 1 2 3 4 5 6 7 8 9 10 12 


a Find )\d2, where dis the difference between the ranks of each observation. 


b Calculate Spearman’s rank correlation coefficient for these data. 
What conclusions can be drawn from this result? 


Correlation 
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7 A veterinary surgeon and a trainee veterinary surgeon both rank a small herd of cows for 
quality. Their rankings are shown below. 


Cow A D F E B C val 
Qualified vet 1 2 3 4 5 6 7 8 
Trainee vet 1 2 5 6 4 3 8 7 


Find Spearman’s rank correlation coefficient for these data, and comment on the experience of 
the trainee vet. 


@) 8 Two adjudicators at an ice dance skating competition award marks as follows. 


Competitor A B C D E F G H I J 
Judge 1 7.8 | 6.6 | 7.3 | 7.4 | 84 | 6.5 | 8.9 | 85 | 6.7 | 7.7 
Judge 2 8.1 | 68 | 8.2 | 7.5 | 8.0 | 6.7 ) 8.5 ) 83 | 6.6 | 7.8 


a Explain why you would use Spearman’s rank correlation coefficient in this case. 
b Calculate Spearman’s rank correlation coefficient r,, and comment on how well the judges agree. 


It turns out that Judge | incorrectly recorded their score for competitor A and it should have been 
als 


c Explain how you would now deal with equal data values if you had to recalculate the 
Spearman’s rank correlation coefficient. 


@) 9 Ina diving competition, two judges scored each of 7 divers on a forward somersault with twist. 


Diver A B C | D E F G 
Judge 1 score 4.5 | 5.1 | 5.2 | 5.2 | 5.4 | 5.7 | 5.8 
Judge 2 score 5.2 | 4.8 | 4.9 | 5.1 | 5.0 | 5.3 | 5.4 


a Give one reason to support the use of Spearman’s rank correlation coefficient in 
this case. (1 mark) 


b Calculate Spearman’s rank correlation coefficient for these data. (4 marks) 


The judges also scored the divers on a back somersault with two twists. 
Spearman’s rank correlation coefficient for their ranks in this case was 0.676. 


c Compare the judges’ ranks for the two dives. (1 mark) 


10 Two tea tasters sample 6 different teas and give each a score out of 10. 


Tea A B C D E F 
Taster 1 score 7 8 6 7 9 10 
Taster 2 score 8 9 5 7 10 10 


a Calculate Spearman’s rank correlation coefficient for these data. (4 marks) 
b Without recalculating the correlation coefficient, explain how your answer to part a would 
change if: 
i taster 2 changed his score for tea C to 6 
ii a seventh tea was tasted and both tasters scored it as 4. (3 marks) 


The tasters tried lots of other types of tea and found that there were now many tied ranks. 
c Describe how you would now find the correlation. (1 mark) 
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ED Hypothesis testing for zero correlation 


Correlation 


You might need to carry out a hypothesis test to determine whether the correlation for a particular 
sample indicates that there is likely to be a non-zero correlation within the whole population. 
Hypothesis tests for zero correlation can use either the product moment correlation coefficient or the 


Spearman's rank correlation coefficient. 


To test whether the population correlation, p, 


is greater than zero or to test whether the 
population correlation, p, is less than zero, 


use a one-tailed test: 


= For a one-tailed test, use either: 


¢ Hp: p=0,H,:p>0o0r 
¢ Hp: p=0,H,:p<0 


To test whether the population correlation is 
not equal to zero, use a two-tailed test: 


= Fora two-tailed test, use: 
¢ Hp: p=0,H,: p70 


Determine the critical region for r or r, 
using the tables of critical values given in the 
formulae booklet and on page 216, or using your 
calculator. This is the table of critical values for 
the product moment correlation coefficient. 


Notation | r is usually used to denote the 


correlation for a sample. The Greek letter rho (p) 
is used to denote the correlation for the whole 
population. If you need to distinguish between 
the PMCC and the Spearman's rank correlation 
coefficient, use r and p for the PMCC and user, 
and p, for Spearman's. 


You carried out hypothesis tests for zero 
correlation with the product moment correlation 
coefficient in your A level course. 


€ SM2, Section 1.3 


t Watch out The table for the Spearman’s rank 


correlation coefficient is different. Read the 
question carefully and choose the correct table. 


Example 


Product moment coefficient 

Level Sample 

0.025 0.01 0.005 size 
0.8000 0.9500 0.9800 0.9900 4 
0.6870 0.8783 0.9343 0.9587 5 
0.6084 0.8114 0.8822 0.9172 6 
0.5509 0.7545 7 
0.4716 0.5822 0.6664 0.7498 0.7977 9 


For a sample size of 8 you see from the 
table that the critical value of r to be 
significant at the 5% level on a one- 
tailed test is 0.6215. An observed value 
of r greater than 0.6215 from a sample 
of size 8 would provide sufficient 
evidence to reject the null hypothesis 
and conclude that p > 0. Similarly, an 
observed value of r less than —0.6215 
would provide sufficient evidence to 
conclude that p < 0. 


A chemist observed 20 reactions, and recorded the mass of the reactant, x grams, and the duration 


of a reaction, y minutes. 


She summarised her findings as follows: 


Yox=20 Yiy=35 YSoxy=65 
Test, at the 5% significance level, whether these results show evidence of any correlation between 
the mass of the reactant and the duration of the reaction. 


Sox? = 35 


doy? = 130 
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Ho: p =O, Ay: p # O 
Sample size = 20 


Significance level in each tail = 0.025 


From the table, critical values of r for a O.025+ 
significance level with a sample size of 20 are 
r = 0.4436, so the critical region is 
r < 0.4436 andr > 0.4436. 

Sy) 


{(e- 22*) (Hye *) 


20 x 35 
- 65 oa <a 


- 2 ane] 
\(s5 20 |\'39 - 36 


= 0.934... 
0.934... > 0.4438. The observed value of r + 
lies within the critical region, so reject Ho. 


There is evidence, at the 5% level of 
significance, that there is a correlation 
between the mass of the reactant and the 


duration of the reaction. 


You can carry out a hypothesis test for zero correlation using Spearman’s rank correlation coefficient 
in the same way. The table of critical values for Spearman’s coefficient is also given in the formulae 


booklet and on page 216. 


Spearman's coefficient 


Sample Level 
size 0.05 0.025 0.01 
4 1.0000 
5 0.9000 1.0000 
6 0.8286 0.9429 
7 0.7143 


9 0.6000 0.7000 0.7833 
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For a sample size of 8 you see from 
the table that the critical value of r, to 
be significant at the 0.025 level ona 
one-tailed test is 0.7381. 


Correlation 


Example 


The popularity of 16 subjects at a comprehensive school was found by counting the number of 
boys and the number of girls who chose each subject and then ranking the subjects. The results are 
shown in the table below. 


Subject A|B|C|D|E| F|G\|H| I J)|}K|)|L)M)N)O4P 


Boys’ ranks, b 2 5 9 8 1 3} 15 | 16] 6 | 10] 12] 14] 4 7 | 11 | 13 
Girls’ ranks, g + 7 | 11} 3 6 9} 12) 16) 5 | 13) 10) 8 2 1 | 15) 14 


a Calculate Spearman’s rank correlation coefficient. 


b Using a suitable test, at the 1% level of significance, test the assertion that boys’ and girls’ choices 
are positively correlated. 


Subject} A} B|C|)D|F| F|G|H|I|J)|K)|L|M|N|O|P 


Extend the 
Boys Jolslolal1/alislieleltoliii4l4i7)1i13| 
ranks, b include rows 
Girls for dand d?. 
AF A Bee |S | 2 GS: | S| Te) (Or) Ge 2 a) ae = 
ranks, g 


. n(n? — 1) 
_ 6x 214 
16(162 — 1) 
= 0.685... State your hypotheses. You are testing for 
=| positive correlation so this is a one-tailed test. 
b He: p=O 
ys Pp = O a] 


ST Find the critical value. 
From the tables for a sample size of 16 the critical value is 0.5624. 


ie ss See if your value of r, 
Since 0.665... > 0.5824, the result is significant at the 1% level. me 
is significant. 


You reject Ho and accept Hi: there is evidence that boys’ and girls Draw a conclusion. 


choices are positively correlated. 


Exercise 


! 


1 A sample of 7 observations (x, y) was taken, and the following values were calculated: 
Yox=29 Sox2=131 SYoy=28 dSoy?=140 Sexy =99 
a Calculate the product moment correlation coefficient for this sample. 


b Test Hp: p = 0 against H,: p 0. Use a 1% significance level and state any assumptions you 
have made. 
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@) 2 The ages, X years, and heights, Ycm, of 11 members of an athletics club were recorded and the 
following statistics were used to summarise the results. 
yxyetes SS ¥=H1275 }ex¥=20704 J X2= 2585 JY? =320019 
a Calculate the product moment correlation coefficient for these data. (3 marks) 


b Test the assertion that the ages and heights of the club members are positively correlated. 
State your conclusion in words and any assumptions you have made. 
Use a 5% level of significance. (5 marks) 


3 A sample of 30 compact cars was taken, and the fuel consumption and engine sizes of the cars 
were ranked. 


A consumer group wants to test whether fuel consumption and engine size are related. 


a Find the critical region for a hypothesis test based on Spearman’s rank correlation coefficient. 
Use a 5% level of significance. 


A Spearman’s rank correlation coefficient of r, = 0.5321 was calculated for the sample. 
b Comment on this value in light of your answer to part a. 


4 For one of the activities at a gymnastics competition, 8 gymnasts were awarded marks out of 10 
for artistic performance and for technical ability. The results were as follows. 


Gymnast A B Cc D E F G H 
Technical ability 8.5 | 86 | 95 | 7.5 | 68 | 91 | 94 | 9.2 
Artistic performance 6.2 | 7.5 | 82 | 67 | 60 | 7.2 | 8.0 | 91 


The value of the product moment correlation coefficient for these data is 0.774. 


a Stating your hypotheses clearly, and using a 1% level of significance, test for evidence of a 

positive association between technical ability and artistic performance. 

Interpret this value. (4 marks) 
b Calculate the value of Spearman’s rank correlation coefficient for these data. (3 marks) 
c Give one reason why a hypothesis test based on Spearman’s rank correlation coefficient 

might be more suitable for this data set. (1 mark) 
d Use your answer to part b to carry out a second hypothesis test for evidence of a positive 

correlation between technical ability and artistic performance. 

Use a 1% significance level. (4 marks) 


5 Two judges ranked 8 ice skaters in a competition according to the table below. 


Sea eh ge) Ge ts Ge | a || al aa 
Judge 
A >| se |e 7) es |r] 4] « 
B 3.|a%) 6|5 |7 | a | a |] 8 


A test is to be carried out to see if there is a positive association between the rankings of the judges. 


a Give a reason to support the use of Spearman’s rank correlation coefficient in this 
case. (1 mark) 


b Evaluate Spearman’s rank correlation coefficient. (3 marks) 
c Carry out the test at the 5% level of significance, stating your hypotheses clearly. (4 marks) 
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Each of the teams in a school hockey league had the total number of goals scored by them and 
against them recorded, with the following results. 


Team A B Cc D E F G 
Goals for 39 40 28 27 26 30 42 
Goals against 22 28 27 42 24 38 23 


Investigate whether there is any association between the goals for and those against by using 
Spearman’s rank correlation coefficient. Use a suitable test at the 1% level to investigate the 
statement, ‘A team that scores a lot of goals concedes very few goals’. 


The weekly takings and weekly profits for six different branches of a kebab restaurant are 
shown in the table below. 


Shop 1 2 3 4 5 6 
Takings (£) | 400 | 6200 | 3600 | 5100 | 5000 | 3800 
Profits (£) 400 | 1100 | 450 | 750 | 800 | 500 


a Calculate Spearman’s rank correlation coefficient, ,, between the takings and 
profit. (3 marks) 


b Test, at the 5% significance level, the assertion that profits and takings are positively 
correlated. (4 marks) 


The rankings of 12 students in mathematics and music were as follows. 


Mathematics 1 2 3 4 5 6 7 8 9 10 11 12 
Music 6 4 2 3 1 7 5 9 10 8 11 12 


a Calculate Spearman’s rank correlation coefficient, r,, showing your value of )}d?.__ (3 marks) 


b Test the assertion that there is no correlation between these subjects. State the null 
and alternative hypotheses used. Use a 5% significance level. (4 marks) 


A child is asked to place 10 objects in order and gives the ordering 
A C H F BDGETJI 

The correct ordering is 
A B CD EF GHIdJ 


Conduct, at the 5% level of significance, a suitable hypothesis test to determine whether there is 
a positive association between the child’s order and the correct ordering. You must state clearly 
which correlation coefficient you are using and justify your selection. (8 marks) 


The crop of a root vegetable was measured over six consecutive years, the years being ranked 
for wetness. The results are given in the table below. 


Year 1 2 3 4 5 6 
Crop (10000 tons) 62 73 52 77 63 61 
Rank of wetness 5 4 1 6 3 2 


A seed producer claims that crop yield and wetness are not correlated. Test this assertion using 
a 5% significance level. You must state which correlation coefficient you are using and justify 
your selection. (8 marks) 
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() 11 A researcher collects data on the heights and masses of a random sample of gorillas. She finds 
that the correlation coefficient between the data is 0.546. 


a Explain which measure of correlation the researcher is likely to have used. 


Given that the value of the correlation coefficient provided sufficient evidence to accept the 
alternative hypothesis that there is positive correlation between the variables, 


b find the smallest possible significance level given that she collected data from 14 gorillas 


c find the smallest possible sample size given that she carried out the test at the 5% level of 
significance. 


Mixed exercise (2) 


@) 1 Wai wants to know whether the 10 people in her group are as good at science as they are at art. 
She collected the end of term test marks for science (s), and art (a), and coded them 


using x = 7p and y= 79 
The data she collected can be summarised as follows, 
yxae? Sox =465 Dlye6s Fo =29 Joop] 454 
a Work out the product moment correlation coefficient for x and y. (3 marks) 
b Write down the product moment correlation coefficient for s and a. (1 mark) 


c Write down whether or not it is it true to say that the people in Wai’s group who are good at 
science are also good at art. Give a reason for your answer. (1 mark) 


@) 2 Nimer thinks that oranges that are very juicy cost more than those that are not very juicy. He 
buys 20 oranges from different places, and measures the amount of juice (j ml), that each orange 
produces. He also notes the price (p) of each orange. 


The data can be summarised as follows, 

> 37 = 979 ype 15 yy 52335 Yop? = 32156 Yo jp = 39950. 
a Find S;, S,, and S;,. (3 marks) 
b Using your answers to part a, calculate the product moment correlation coefficient. (1 mark) 


c Describe the type of correlation between the amount of juice and the cost and state, with a 


reason, whether or not Nimer is correct. (2 marks) 
3 A geography student collected data on GDP q 
per capita, x (in $1000s), and infant mortality ee = 77,0875 
rates, y (deaths per 1000), from a sample of Syq= — 11.625 
8 countries. She coded the data using : 
y diy = 491 

p=x- 10and gq == and drew a scatter 

. Res Su, = 85.5 
diagram of p against q. 40 > - 


10 Pp 


Unfortunately, she spilt coffee on her work and the only things still legible were her unfinished 
scatter diagram and a few summary data calculations. 


a Show that S,, = S,.. (3 marks) 
b Find the value of the product moment correlation coefficient between p and q. (4 marks) 
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c Write down the value of the correlation coefficient between x and y. (1 mark) 


d With reference to the scatter diagram, comment on the result in part b. (1 mark) 


Two judges at a cat show place the 10 entries in the following rank orders. 


Cat A B C|D|E| F G|A|I J 

First judge 4 6 1 2 5 3 | 10} 9 8 7 

Second judge 2 9 3 1 7 + 6 8 5 | 10 
a Explain why Spearman’s rank correlation coefficient is appropriate for these data. (1 mark) 
b Find the value of Spearman’s rank correlation coefficient. (3 marks) 
c Explain briefly the role of the null and alternative hypotheses in a test of significance. (1 mark) 
d Stating your hypotheses clearly, carry out a test at the 5% level of significance and use your 


result to comment on the extent of the agreement between the two judges. (4 marks) 
a Explain briefly the conditions under which you would measure association using Spearman’s 
rank correlation coefficient. (1 mark) 
b Nine applicants for places at a college were interviewed by two tutors. Each tutor ranked the 
applicants in order of merit. The rankings are shown below. 


Applicant A|B|C|D|E|F|G|AT I 


Tutor 1 1; 2;,3 ;,4;5 | 64]7 48 49 

Tutor 2 1/3 }|,5);4]2),7)9 | 8 | 6 
By carrying out a suitable hypothesis test, investigate the extent of the agreement between the 
two tutors. (7 marks) 


In a ski jumping contest each competitor made two jumps. The order of merit for the 
10 competitors who completed both jumps are shown. 


Ski jumper A|B|C|D\|E|\FI|G|AI] I Js 
First jump 2 9 7 4 | 10] 8 6 5 1 3 
Second jump 4/10] 5 1 8 9 2 7 3 6 


a Calculate, to 2 decimal places, Spearman’s rank correlation coefficient for the performance of 
the ski jumpers in the two jumps. (3 marks) 


b Using a 5% significance, and quoting from the table of critical values, investigate whether 
there is a positive association between performance on the first and second jumps. State your 
null and alternative hypotheses clearly. (4 marks) 


An expert on porcelain is asked to place seven china bowls in date order of manufacture, 
assigning the rank | to the oldest bowl. The actual dates of manufacture and the order given by 
the expert are shown below. 


Bowl A B Cc D E F G 
Date of manufacture 1920 | 1857 | 1710 | 1896 | 1810 | 1690 | 1780 
Order given by expert 7 3 + 6 2 1 5 


Carry out a hypothesis test to determine whether the expert is able to judge relative age 
accurately. You must state: 

* the significance level of your test 

¢ your null and alternative hypotheses 

¢ which correlation coefficient, with justification. (8 marks) 
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A small bus company provides a service for a small town and some neighbouring villages. 

In a study of their service a random sample of 20 journeys was taken and the distances x, in 
kilometres, and journey times ¢, in minutes, were recorded. The average distance was 4.535 km 
and the average journey time was 15.15 minutes. 


a Using )°x? = 493.77, )-t?= 4897, ) xt = 1433.8, calculate the product moment correlation 


coefficient for these data. (3 marks) 
b Stating your hypotheses clearly test, at the 5% level, whether or not there is evidence of a 

positive correlation between journey time and distance. (4 marks) 
c State any assumptions that have to be made to justify the test in part b. (1 mark) 


A group of students scored the following marks in their statistics and geography exams. 


Student A} B;}C|]D|E|F|G| aH 
Statistics 64 |} 71 | 49 | 38 | 72 | 55 | 54 | 68 
Geography 55 | 50 | 51 | 47 | 65 | 45 | 39 | 82 


a Find the value of Spearman’s rank correlation coefficient between the marks of these 
students. (3 marks) 


b Stating your hypotheses and using a 5% level of significance, test whether marks in 
statistics and marks in geography are associated. (4 marks) 


An international study of female literacy investigated whether there was any correlation 
between the life expectancy of females and the percentage of adult females who were literate. 
A random sample of 8 countries was taken and the following data were collected. 


Life expectancy (years) 49 | 76 | 69 | 71 | 50 | 64 | 78 | 74 
Literacy (%) 25 | 88 | 80 | 62 | 37 | 86 | 89 | 67 


a Find Spearman’s rank correlation coefficient for these data. (3 marks) 


b Stating your hypotheses clearly test, at the 5% level of significance, whether or not there 
is evidence of a correlation between the rankings of literacy and life expectancy for 
females. (4 marks) 


c Give one reason why Spearman’s rank correlation coefficient and not the product moment 
correlation coefficient has been used in this case. (1 mark) 


d Without recalculating the correlation coefficient, explain how your answer to part a would 
change if: 


i the literacy percentage for the eighth country was actually 77 


ii a ninth country was added to the sample with life expectancy 79 years and literacy 
percentage 92%. (3 marks) 


A much larger sample is taken and it is found that there are many tied ranks. 


e Describe how you would find the correlation with many tied ranks. (2 marks) 
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Six Friesian cows were ranked in order of merit at an agricultural show by the official judge and 
by a student vet. The ranks were as follows: 


Official judge me ai ee wie ae. 
Student vet 1 5 4 2 6 3 
a Calculate Spearman’s rank correlation coefficient between these rankings. (3 marks) 


b Investigate whether or not there was agreement between the rankings of the judge and the 
student. 


State clearly your hypotheses, and carry out an appropriate one-tailed significance 
test at the 5% level. (4 marks) 


As part of a survey in a particular profession, age, x years, and salary, £y thousands, were recorded. 
The values of x and y for a randomly selected sample of ten members of the profession are as 
follows: 


x | 30 | 52 | 38 | 48 | 56 | 44 | 41 | 25 | 32 | 27 
y | 22 | 38 | 40 | 34 | 35 | 32 | 28 | 27 | 29 | 41 


(Sox = 393 Sox? =16483) doy=326 = Soy2?=10968 YS oxy = 13014) 
a Calculate, to 3 decimal places, the product moment correlation coefficient between age 


and salary. (3 marks) 
b State two conditions under which it might be appropriate to use Spearman’s rank 
correlation coefficient. (1 mark) 


c Calculate, to 3 decimal places, Spearman’s rank correlation coefficient between age 
and salary. (3 marks) 
It is suggested that there is no correlation between age and salary. 


d Set up appropriate null and alternative hypotheses and carry out an appropriate test to 
evaluate this suggestion. Use a 5% significance level. (4 marks) 


A machine hire company kept records of the age, ¥ months, and the maintenance costs, 
£Y, of one type of machine. The following table summarises the data for a random sample of 
10 machines. 


Machine A B Cc D E F G H I J 
Age, x 63 12 34 81 51 14 45 74 24 89 
Maintenance costs, y 111 25 41 181 64 21 51 145 43 241 
a Calculate, to 3 decimal places, the product moment correlation coefficient. 
(You may use ))x? = 30625.  }iy?=135481 oxy =62412.) (3 marks) 
b Calculate, to 3 decimal places, Spearman’s rank correlation coefficient. (3 marks) 


For a different type of machine similar data were collected. From a large population of such 

machines a random sample of 10 was taken and Spearman’s rank correlation coefficient, based 

on )-d? = 36, was 0.782. 

c Using a 5% level of significance and quoting from the tables of critical values, interpret this 
rank correlation coefficient. Use a two-tailed test and state clearly your null and alternative 
hypotheses. (4 marks) 
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@) 14 The data below show the height above sea level, x metres, and the temperature, y °C, at 
7.00 a.m., on the same day in summer at nine places in Europe. 


Height, x (m) 1400 | 400 280 790 390 590 540 1250 | 680 
Temperature, y (°C) 6 15 18 10 16 14 13 7 13 


a Use your calculator to find the product moment correlation coefficient for this 
sample. (1 mark) 
b Test, at the 5% significance level, whether height above sea level and temperature are 
negatively correlated. (4 marks) 


On the same day the number of hours of sunshine was recorded and Spearman’s rank 
correlation coefficient between hours of sunshine and temperature, based on )\d? = 28, 


was ().767. 
ce Stating your hypotheses and using a 5% two-tailed test, interpret this rank correlation 
coefficient. (4 marks) 


© 15 a Explain briefly the conditions under which you would measure association using Spearman’s 
rank correlation coefficient rather than the product moment correlation coefficient. (1 mark) 


At an agricultural show 10 Shetland sheep were ranked by a qualified judge and by a trainee 
judge. Their rankings are shown in the table. 


Qualified judge 1 2 3 4 5 6 7 8 9 | 10 
Trainee judge 1 2 5 6 7 8 10 |} 4 3 9 
b Calculate Spearman’s rank correlation coefficient for these data. (3 marks) 


c Using a suitable table and a 5% significance level, state your conclusions as to whether there 
is some degree of agreement between the two sets of ranks. (4 marks) 


@) 16 The positions in a league table of 8 rugby clubs at the end of a season are shown, together with 
the average attendance (in hundreds) at home matches during the season. 


Club A B Cc D E F G H 
Position 1 2 3 4 5 6 7 8 
Average attendance 30 32 12 19 27 18 15 25 


Calculate Spearman’s rank correlation coefficient between position in the league and home 
attendance. Comment on your results. (4 marks) 


17 The ages, in months, and the weights, in kg, of a random sample of nine babies are shown in 


the table below. 
Baby A B C D E F G vel I 
Age (x) 1 > 2 3 3 3 4 4 5 
Weight () 4.4 5.2 5.8 6.4 6.7 7.2 7.6 7.9 8.4 


a The product moment correlation coefficient between weight and age for these babies was 
found to be 0.972. By testing for positive correlation at the 5% significance level interpret 
this value. (4 marks) 
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A boy who does not know the weights or ages of these babies is asked to list them, by 
guesswork, in order of increasing weight. He puts them in the order 


A C E BGODIF 4A 


b Obtain, to 3 decimal places, a rank correlation coefficient between the boy’s order and the 
true weight order. (3 marks) 


c By carrying out a suitable hypothesis test at the 5% significance level, assess the boy’s ability 
to correctly rank the babies by weight. (4 marks) 


Challenge 


x, and y; are ranked variables with no ties, so that each takes the values 1, 2, 3, ... 7 exactly once. 
The difference for each pair of data values is defined as d; = y; — x;. 


a Explain why >ox => oy? = = r?, and hence express the quantity in terms of n. 


oe 
b Explain why /S-(x;— X)?9(v;-)2 = ox; — ¥)2, and hence show that DSS CH = La = nh» y 
ie 
c By expanding >/(y;- x,)2, show that }ox;y;= >ox?- aes 
n(n2 — 1) _ ode 


d_ Hence show that >/(x;- )(¥i- 3) = 5 2 


Yo(x; — ¥) (7; - J) oe 6>_d? 
GO) ay 


e Hence, or otherwise, prove that: 


Summary of key points 


1 The product moment correlation coefficient, 7, is given by r= 


S 


xy 


SS) 

2 Spearman’s rank correlation coefficient is a special case of the product moment correlation 
coefficient in which the data are converted to rankings before calculating the coefficient. To 
rank two Sets of data, X and Y, you give rank 1 to the highest of the x; values, 2 to the next 
highest and so on. You do the same for the y; values. 


3 A Spearman's rank correlation coefficient of: 
+1 means that rankings are in perfect agreement 
—1 means that the rankings are in exact reverse order 
O means that there is no correlation between the rankings. 


4 lf there are no tied ranks, Spearman’s rank correlation coefficient, r,, is calculated using: 
6>_d? 

7 n(n? — 1) 

and n is the number of pairs of observations. 


es where d is the difference between the ranks of each observation, 
5 Equal data values should be assigned a rank equal to the mean of the tied ranks. 


6 Fora one-tailed test, use either: For a two-tailed test, use: 
0 Inge OS), InheoSO or 9 labs @=O, Inheone 
© nya /O= 0) Ine a<C 
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After completing this chapter you should be able to: 
e@ Understand and use the probability density function for a 
continuous random variable — pages 45-50 
e@ Understand and use the cumulative distribution function 
for a continuous random variable — pages 51-56 
@ Find the mean, variance, mode, median and percentiles of a 
continuous random variable and describe the skewness -> pages 56-71 


e Understand, use and model situations using the continuous 
uniform distribution — pages 71-82 


1 The discrete random variable X has 
probability function 


k(x = 1) x=2,3,4,5 
PIO 39) = 1 


D =O 


where k is a constant. Find: 
b P(X = 5) 
< FS1, Chapter 1 


sti, 8 A continuous uniform distribution 
can be used to model a random 
variable that is equally likely to 
take any value in a given range. 
If a dart is aimed 
at the centre of a 
dartboard, the 
angle 0 can be Centre 
modelled using 
a continuous 
uniform distribution. 


ro Wa 


a the value of k 
@ leo) 
2 The random variable Y has E(Y) =2 and 

E@4) waking: 

a E(2Y) b Var(Y) 

c Var(4Y - 2) € FS1, Chapter 1 


Dart 


1 
1 

1 

l 

1 
Ace 
Ca 


2a 
3 i 3x + 1dx = 168, where a is a constant. 


Find the value of a. © Pure Year 2, Chapter 11 
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GS) Continuous random variables 


A continuous random variable can take any one of infinitely many values. The probability that a 


continuous random variable takes any one specific value is 0, but you can write the probability that it 
takes values within a given range. 


If ten coins are flipped: 
P(X =4) 


ell... 


P(Y < 20) 
5 0 15 


1 20 25 
0123 45 6/7 8 9 10 Height (cm) 
X = number of heads Y = maximum height achieved by a flipped coin 
Probability of getting 4 heads is written as P(X = 4) Probability that the maximum height is less than 
X is a discrete random variable 20cm is written as P(Y < 20) 


Y is a continuous random variable 


To describe the probability distribution of a discrete random variable you usually give a table of 
values, or a probability mass function, to define the probability of the random variable taking each 
value in its sample space. 

Because the probability of a continuous random variable taking a specific value is 0, you cannot 


define its distribution in this way. Instead, you can use a probability density function (p.d.f.) to 
define the probability of the random variable taking values within a given range. 


A normally distributed random variable is an example of a continuous random 
variable. This curve shows the probability density function for a normal distribution: 


© SM2, Section 3.1 


= If X is a continuous random variable with probability density function f(x), then 


b 
: P(a<X<b)=[ f(x) dx 


- | f(x)dx=1- 
In practice, probability density functons are often non-zero on a limited subset of the real numbers. 


For the final condition, you only have to integrate the probability density function across all values of 
x for which f(x) is non-zero. 
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Example 


In each case, state whether the function could be a probability density function, where k is a 


positive constant. 


t Online ) Explore probability density ey 


functions using GeoGebra. 


cc 2x+1 0<xs<3 
i) ee otherwise 
M2 =x) —2<x=<2 
f(x) = 
ne) 0 otherwise 
kx(5 — x) 0<x<6 
f(x) = 
oa 0 otherwise 
a f(x)a 


O 3 x 


b f(x) A 


2 0 a * 


© tx)a 


pdf. 


Area under f(x) = 12 4 1, so f(x) cannot be a pd.f. 


Area under f(x) = 8k, so f(x) could be a p.df. if k = 3 


For all positive values of k, there are values in 
O =x S G for which f(x) < O, so f(x) cannot be a 


t Watch out ) x can take positive or 


negative values, but f(x) must be non- 
negative for all possible values of x. 
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The random variable X has probability density function 


kx(4 — x) 2<x<4 
f(x) = . 
0 otherwise 
Find: 


a_ the value of & and sketch the p.d.f. 


Continuous distributions 


b P(2.5 < X < 3), giving your answer to 3 decimal places. 


4 
a [kx - x2) dx =1 - 
2 


314 
k[ex? - =] =1 
31, 


K((32 -@) -[8-3))=1 


Sketching the graph: 
f(x) a 
C/O Sse 


ie 3x(4 - x) ask 


b Pi25<71 <3)= - 46 


3 
aa [4x — x°) dx 


3 Bie ak 3) 
alex — 3k" 25 


3 175\ _ 41 
a ~ Ba 128 


0.320 (34.2) 


CREED fi) = oforx<2andx>4. 


Make sure you only draw your curve between 
x=2andx=4. 


t Watch out ) Your answer is a probability. 


Make sure it is between 0 and 1. 
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Example a 


The random variable X has probability density function 


k Lexa 
f(x) =¢ A(x -— 1) 2<=x<4 


0 otherwise 
a Sketch f(x). b Find the value of k. c Find P(X > 3). 
a f(x)a 
3k- 
2k+ Sf 
k+ i 1 i 
oOo 
Oo" 12345 * 
eS 4 
b J kdx + [ki - tax =1- 
4 
[kxIf + ea = kx| = 1 
2 2 
k + ((8k — 4k) — (2k — 2k) =1 
5k= 4 
rae 
sacmc sec vember vemea tam 


Exercise 3A) 


1 Give reasons why the following are not valid probability density functions. 


1 
a fa) =| c fa) =| 


qx -l=x<2 x? Llex=3 


x3-2 -l<x<3 
bo) -{ 0 otherwise 


0 otherwise 0 otherwise 


2 For what value of « is the following a valid probability density function? 
k(x?-1) -4<x<-2 
f(x) = 
@) 0 otherwise 


3 Sketch the following probability density functions. 


a toy= {> 2<x<6 b ft ={ 
0 otherwise 


+ (5- x) l<xx<4 


0 otherwise 


4 Find the value of & so that each of the following is a valid probability density function. 
kx 1sxs3 lor VSx= 3 _ [kl +s") -lSx=2 
a) = 0 otherwise ba 0 otherwise ooo 0 otherwise 
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The continuous random variable XY has probability density function given by 


ny = {OO Vea? 
0 otherwise 
a Find the value of k. 
b Sketch the probability density function for all values of x. 


c Find P(X > 1). 


The continuous random variable XY has probability density function given by 


kx?(2 — x) 0<x<2 
f(x) = 
&) 0 otherwise 
a Find the value of k. (3 marks) 
b Find P(O < XY <1). (3 marks) 


The continuous random variable XY has probability density function given by 


kx3 1<xx4 
f(x) -{ 0 otherwise 


a Find the value of k. (3 marks) 
b Find P(l < XY < 2) (3 marks) 


The continuous random variable XY has probability density function given by 


k Q0<x<2 

f(x) =< k(2x — 3) 25x<3 

0 otherwise 
a Find the value of k. (2 marks) 
b Sketch the probability density function for all values of x. (2 marks) 


A different continuous random variable Y has probability density function given by 


3 
f(y) = — — - : 
0 otherwise 
c Given that XY and Y are independent, find the probability that Y and Y are both less than 1. 
(4 marks) 


The length of time visitors spent on a news website, XY minutes, is modelled using the 
probability density function 


fay = [we 0<x<10 
0 otherwise 
a Use this model to find the probability that a randomly chosen visitor spends less than 
30 seconds on the website. (3 marks) 
b Sketch the probability density function. (2 marks) 


In reality, a small number of customers are found to spend more than 10 minutes on the website. 
c Sketch a probability density function that might provide a better model for X. (1 mark) 
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Chapter 3 


The continuous random variable X has probability density function 


k 
0 otherwise 


a Find the value of k. (3 marks) 
b Find P2 < X < 4), giving your answer in the form me where a and / are integers to 
: Ind 
be determined. (3 marks) 


The continuous random variable X has probability density function 


k 
f(x) =4 x42 -Il<x=<4 
0 otherwise 
a Find the value of k. (3 marks) 
b Find P(| < X < 3), giving your answer to 3 decimal places. (3 marks) 


The continuous random variable X has probability density function 


ray = OSS 1 

0 otherwise 
a Find the value of k. (3 marks) 
b Sketch the probability density function for all values of x. (2 marks) 
¢ Find PO < X¥ <2). (3 marks) 


Challenge Problem-solving 


The length, T minutes, of a telephone call to a customer There is no upper limit on the 
service department has probability density function, sample space of 7: You will need 
ie to use an improper integral to 
ee find the area under the p.d/f. 


0 otherwise 


a Find the value of k. 
b Find the probability that a call will be: 


c 
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i less than 3 minutes ii more than 20 minutes. 
Given that P(p < T < 2p) = 0.12, find the value of p. 


Continuous distributions 


3.2) The cumulative distribution function 


Calculating probabilities from the p.d.f. can be time consuming, and requires integration. You can save 
time by finding the cumulative distribution function (c.d.f.) of a random variable. 


= Fora random variable X, the cumulative Notation ] The pats written 
distribution function F(x) = P(X < x). Wit EAlaee ea Bene 


This is equivalent to the area under the p.d.f. to the left of x. c.d.f. is written with a capital F. 


Fx) =P(X<a)=[ flo de 


y F(x) = area 


You use ¢ rather than x as the variable in 
the integration to avoid confusion with x 
as the limit of the integration. 


= If X is a continuous random variable with c.d.f. F(x) Note ee 
and p.d.f. f(x): walt eae 

d : eee gs 

F(x) = dx and F(x) = | f(¢)dt Differentiate 


The random variable X has probability density function 


1 
z 1<x<3 
nye { x 


0 otherwise Online ) Explore cumulative distribution ey 


Find F(x). functions using GeoGebra. 


a/e) 
E | Problem-solving 


~x 4 The p.d.f. is 0 for all values of x < 1, so you can 
a use 1 as the lower limit for the integration. 
Method 2 


Fo = [ixdx- 
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32 

a 

etl 

alae 
O eA 
Ve 
F(x) = oa jie oe S 

i x SS 


Example @ 


The random variable X has probability density function 


1 


3 1<x<2 


fx)=)3(x-1l) 2<x<4 
0 otherwise 


Specify fully the cumulative distribution function of YX. 


Method 1 
it c= 


F(x)=O  soF(t)=0 
ts ee 2 
Foy =F) + [ Lar 


x 


[5¢] 


so F(2) =¢ 
f2=<x<4 


F(x) = F(2) + ec = thd 


=t4/E-47 
“so Sis 


ie 


ee 


itt 
=o. 5 3 


-()+((5-3)-(6- 


5 


) 


52 


t Watch out | Integrate each section of the p.d-f. 


separately. The c.d.f. is cumulative so for each 
section, you need to add on the value of the c.d-f. 
at the upper limit of the previous section. 


Continuous distributions 


Method 2 
irt<*«= 2 


Fix) = [tdxedxte 


f2<x<4 
2 
Fx) = [$e Ddr= Zed 
ae 4 1 
O x <1 
tx - s l= x= 2 
= Pa ee eee t Watch out ] Remember to write the cumulative 
iG Bs =a distribution in full. This means you need to define 
Sw F(x) for all values of x € R. 


1 
Example CS 


The random variable X has cumulative distribution function 


0 x<0 
F@/a(s*tae O=x=<2 
1 x >2 
Find: 
a P(X < 1.5) b P(0.5 = X¥ S 1.5) 
ec P(X¥=1) d_ the probability density function, f(x). 


a PS 15) = 70S 
=$x154+3x 152 


= 0.6375 


b P(0.5 < X¥ <1,5) = F(1.5) — F(O.5) ——— P05<15)=P(¥<15)-P(K<05) 


= 0.6375 — 0.1375 
=0.5 


di 3 1,3 
2) 
S ax(5* + 26%] = 5 + 7g% 


1 3 
eta (OS eS 2 
a) = ai t Watch out | If F(x) is constant on a given 


O otherwise interval, then f(x) = 0 on that interval. For x > 2, 


F(x) = 1 and ad () =O, 
dx 
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Exercise 3B) 


1 
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The continuous random variable XY has probability density function given by 


ray ={ 0<x<2 
0 otherwise 


Find F(x). 


The continuous random variable XY has probability density function given by 


fay =f 409 1=¢=3 
0 otherwise 


Find F(x). 


The continuous random variable XY has probability density function given by 


$x 024<3 
f(x)=)5(6-x) 3<x<6 
0 otherwise 
Define fully the cumulative distribution function of YX. (6 marks) 
The continuous random variable XY has probability density function given by 
k 0sx<3 
f(x) = 4 k(2x — 5) Se 5 
0 otherwise 
a Sketch f(x). b Find the value of k. c Find F(x). 
The continuous random variable XY has cumulative distribution function given by 
0 x<2 
F(x) =4 302-4) 2x83 
1 x >3 
Find the probability density function, f(x). (3 marks) 
The continuous random variable Y has cumulative distribution function given by 
0 x<l1 
F(x)=4 $(x-1) 1<x<3 
1 x >3 
Find: 
a P(Y $2.5) b P(X > 1.5) ¢ P04 =X =725) 


7 


€/P) 10 


Ep) 11 


Continuous distributions 


The continuous random variable _X has cumulative Problem-solving 
distribution function 
Use the fact that F(2) = 0 and F(4) =1 


0 x2 to form two simultaneous equations 
F(x) = Ex? +q a ae in pand q. 
1 x>4 
Find the exact values of p and g, showing your working clearly. (7 marks) 


The continuous random variable X has cumulative distribution function given by 


0 x<l 
F(x) =4 3003 -2x2+x) 1<x<2 
1 x>2 
a Find the probability density function f(x). (3 marks) 
b Sketch the probability density function. (2 marks) 
c Find P(X < 1.5). (1 mark) 


The continuous random variable X has probability density function given by 


k(4 - x? 
re =| (=>) 0O<xS<2 
0 otherwise 
a Show that k = = (3 marks) 
b Find the cumulative distribution function of X. (3 marks) 
c Find P(0.69 < X¥ < 0.70). Give your answer correct to one significant figure. (2 marks) 


A student attempts to define a random variable X by means of the following 
cumulative distribution function: 


Nie got 
Without using calculus, explain why this is not a valid cumulative distribution function. 
(2 marks) 
The continuous random variable Y has cumulative distribution function 
0 x <0 
F(x) =} Gy(kx-3)  0<x <3 
1 x> 3 
where k is a constant. 
Find: 
a the value of k (2 marks) 
b P(Y > 2) (2 marks) 


55 


Chapter 3 


(12 The continuous random variable X has probability density function f(x) given by 


1 
(E) Ray = {sIn7 Lax 7 
0 otherwise 
Find F(x). (3 marks) 


(E/P) 13 The continuous random variable X¥ has probability density function given by 


f(x) = a 0<x<4 


0 otherwise 
Find F(x). (3 marks) 
(E/P) 14 The continuous random variable X has cumulative distribution function F(x) given by 
0 x<1 
F(x) =) k(x-14+Inx) I1<x<3 

1 x>3 
Find: 
a the exact value of k (3 marks) 
b the probability density function f(x). (3 marks) 


Challenge 


The lifetime, in years, of a light bulb is modelled by the random variable 
T with probability density function 


-1.25¢ = 
al et 
Find: 
a anexpression for F(t) 
b the probability that a light bulb lasts for between 1 and 2 years 
c the probability that a light bulb lasts for more than 3 years. 


€ Mean and variance of a continuous distribution 


You can extend the ideas of mean and variance of a random variable to continuous random variables. 


= If X is a continuous random variable with probability density function f(x): 
* the mean or expected value of X is given by 


E(X) == | xf (x) dx These definitions correspond 
= to the mean and variance of a discrete 
* the variance of X is given by random variable, with >> replaced with 
Var(X) = 02 = i. (x — pu)? f(x) dx iS and the probability mass function 
= replaced with the p.d.f. 
_ i. x2 F(x) dx — p2 € FS1, Chapter 1 


These definitions will be given in the formulae booklet in your exam. 
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Continuous distributions 


You can also find the mean of a function of a continuous random variable in a similar way as 
with discrete random variables: 


= If X is a continuous random variable, then E(g(X)) = [ 8) f(x) dx 


This gives the following convenient way to calculate Var(X): 
= Var(X) = E(X’) — (E(X))? 


In the case where g(X) is a linear function of the form aX + 5, it is useful to learn the following 


results: 
ETUC) These results are the same as 

* E(aX + b) = aE(X) +b those for discrete random variables. 
¢ Var(aX + b) = a@Var(X) € FS1, Section 1.3 
Example GG 
A random variable Y has probability density function 

1 

a l<ysx3 

f() = | - 

0 otherwise 
Find: 
a E(Y) b Var(Y) c EQY-3) d Var(2Y —- 3) 


3 
a EY) =| fyay 


= |p? 


e E(2Y - 3) = 2E(Y) - 3- 


=2x¥-3 


4 


3 


d Var(2Y - 3) = 4Var(Y) - 


eng ee xl 
= 36.7 9 


Ly 
ny 
ees 
Il ober 
— 

w | 
Al= =) 
a nN = 
a 
= = |r 
P10 
— I 
Do ols 
pels 
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Chapter 3 


A random variable X has probability density function 
3 
fe = [ROB A2x= 3 
otherwise 
a Sketch the probability density function. 


b Find E(X). 


a The sketch of the p.df. is Problem-solving 


If the p.d.f. of a continuous random variable, X, 


F(x)A 
3) is symmetric about some point x = a, then 
ts) E@Or=tat 


AV) - 2% 
b By symmetry E(Y) = 1 


Example 9) 


The random variable X has probability density function 


2 


jsx 0<sx<3 
f(x) = 4(5- x) ,2 725 
0 otherwise 
Find: 
a E(X) b Var(X) 


3 i] 
a EX) = | gxtax + | 5(Sx - xdx 


5 


= (%-0)+((4-¥)- (9-4) -¥ 
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In a block-building video game, a cube of side length XY pixels is randomly generated. 
X is a continuous random variable with probability density function 


awo(x+10) 20<x<40 
0 otherwise 


f(x) = | 


a Find: i E(x) ii E(x’) iii Var(X) 
b Find the expected volume of the cube. 


, 40 

ai E(X) = agg], xe + 10)dx 
1 1 40 

=gut ? oF |. 


_ oi oe - ae 
~ 600 3 3 


165 


er 


os 1 ae 2 
Hi E(K2) = apg] x2lx + 10) dx- 


40 
a NA, 10. 3] 
=ea8 ee le 


= 4 (EPegove _ 200208) 
™ 800 3 3 


_ 2950 
=e 


Pa 2950 185\2 _ 1175 
iii, Var(X) = 3 - (2) = a 


1 pt 4 
b EX?) = gop], (x + 10) dx 


nie Sya] 
80015 2” J20 t Watch out | The expected volume is E(X?). 


= BOG (26 880 OOO — 1040 000) This is not the same as (E(X))?. You need to use 
= 32300 the rule E(g(X)) = [a f(x) dx 


Exercise 


! 


1 The continuous random variable XY has a probability density function given by 
F a 0<x<2 
= 0 otherwise 


Find: 
ak b E(X) ec Var(X) 


2 
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The continuous random variable Y has a probability density function given by 
y? 
=~ OUsys3 
f(y)=4 9 
0 otherwise 
Find: 
a E(Y) b Var(Y) c the standard deviation of Y. 
The continuous random variable Y has a probability density function given by 
Jy 
=~ Oxys<4 
f(y) =4 8 
0 otherwise 
Find: 
a E(Y) 
b Var(Y) 
c the standard deviation of Y 
Hint } j = E(Y). In part d, use your answer from 
d P(Y> p) 
part a. 
e Var(3Y+ 2) 
f E(¥+2) 


The continuous random variable XY has a probability density function given by 
‘ ki-x) O0<x<1 
O)= 0 otherwise 


a Find k. (3 marks) 
b Find E(X). (3 marks) 
c Show that Var (X) = & (2 marks) 
d Find P(XY > py). (3 marks) 
The continuous random variable XY has a probability density function given by 
‘ 12x(1-x) O<x<1 
Co 0 otherwise 
Find: 
a P(X < 0.5) b EX) 
The continuous random variable X has a probability density function given by 
3 
s(1 +x?) -ls<x<l 
f(x) = | 3 ( 7 ) 
0 otherwise 


ketch th ili ity functi f X. (2 mark 
a Sketch the probability density function o (2 marks) Probiane solving 


b Write down E(YX). (1 mark) 

Use symmetry to answer part b. 
c Show that o? = 0.4. (3 marks) 
d Find P(-o < X¥ <o). (3 marks) 


7 
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The continuous random variable 7 has a probability density function given by 


‘ a 0<t<2 
= 0 otherwise 


where k is a positive constant. 


a Find k. b Show that E(T) is 1.6. 
Find: 
ec EQT+3) d Var(T) e Var(2T + 3) f P(T <1) 


The continuous random variable XY has a probability density function given by 


x 0<x<3 
21 
f(x)=4 1 g27<5 
0 otherwise 
a Draw a sketch of f(x). 
Find: 


b E(X) ec Var(X) d the standard deviation, o, of X. 


The continuous random variable XY has a probability density function given by 
a(x-1l) 1<x<2 
f(x)={ZS-x) 2<x<5 
0 otherwise 
a Sketch f(x). 
Find: 
b E(X) 
ce Var(X) 


Continuous distributions 


(2 marks) 


(5 marks) 
(4 marks) 


Telephone calls arriving at a company are referred immediately by the telephonist to other 
people working in the company. The time a call takes, in minutes, is modelled by a continuous 


random variable 7, having a probability density function given by 
kt? 0<t<10 
= | 0 otherwise 
a Show that & = 0.003. 
Find: 
b E(T) 
ec Var(T) 
d the probability of a call lasting between 7 and 9 minutes. 
e Sketch the probability density function. 


(3 marks) 


(3 marks) 
(2 marks) 
(3 marks) 
(2 marks) 
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11 A ccontinuous random variable X has probability density function 


roy =f 0<x<2 


0 otherwise 
Find: 
a E(X) (3 marks) Problem-solving 
b E(x’) (3 marks) é 
= va 
© Var(X) (2 marks) E(x) = |x?) dx 
12 The random variable X has cumulative distribution function 
0 x<0 
x2 
F(x) = 100 0 Sx 10 
1 x> 10 
a Find Var(X). (4 marks) 
b Show that E(X?) = 400 (3 marks) 


i913 A continuous random variable X has a probability density function given by 


(E/P) rove {s 1<x<3 


x 
0 otherwise 
Find: 
a the value of k (3 marks) 
b E(X) (3 marks) 
ec Var(X) (3 marks) 
(E/P) 14 A continuous random variable ¥ has a probability density function given by 
“ l<x<2 
f(x) = 4 x3 - x) 
0 otherwise 
3 
a Show that c= ind (3 marks) 
b Calculate the mean and variance of X. (6 marks) 


(E/P) 15 A continuous random variable ¥ has probability density function 


fxs 2x vey = 1 
~ ) 0 otherwise 
Find Edin x). (5 marks) 
Challenge t Watch out ) Only use 
Given that f(x) is the probability density function of a continuous the definitions given in the 
random variable X, prove, from the following definitions, that question in your proof. 


Var(X) = E(X2) — (E(X))2 
HOO) sn [xtoo dx 


Vr@e a= ie Se tCACEs 
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Continuous distributions 


E> Mode, median, percentiles and skewness 


You need to be able to find the mode of a continuous random variable. 


= The mode of a continuous random variable is the value of x for which the p.d.f. is a 
maximum. 


This is the value of x for which the probability distribution is ‘most dense’. A random variable can have 
more than one modal value, though you will usually only be asked to find the mode in cases where 
the probability density function has a unique maximum value. 


Example 


The random variables Y and Y have probability density functions f(x) and g( y) respectively. 


‘ 12rd -x) O<x<1 2y Osysl 
Oe 0 otherwise a(y) = 0 otherwise 
Find the mode of: 
a X bY 
a f(x) 
O 1 


qa or, )\ 
ay 2% 12x3\= 0 
24x - 36x2=0 


12x(2 = 3x) = 0 
2 


x= 3 
Mode = é 
b gly)A 
24 1 
O 1 y 
t Watch out } The mode does not need to occur 
Mode = 1 


at or even near the ‘middle’ of a probability 
distribution. 


Chapter 3 


You can use the cumulative distribution function to define measures of location for a continuous 


random variable. 

= If X is a continuous random variable with c.d.f. F(x): 
¢ the median of _X is the value m such that F(7m) = 0.5 
¢ the lower quartile of X is the value Q, such that F(Q,) = 0.25 
¢ the upper quartile of X is the value Q, such that F(Q;) = 0.75 


¢ the nth percentile of X is the value P,, such that F(P,,) = qa 


Example 


A continuous random variable X has probability density function 
é {" - 4x3 

Find: 

a thec.d.f. of X 


0O<x<l 


otherwise 


b the median value of X. 


a Method 1 
F(x)= [ (40-409) dt- 
(0) 


= [272 - 1415 
= 2x? — x4 
Method 2 
F(x) = | (4x - 4x3) dx 


= 2x°-xt+0e 


FQ) =0 
c= GC 
O 
Fay e4 2x" = xt 


1 
b 2m? -m*=0,5+ 
2m* —- 4m? +1=0 


mea tee = 8 


4 
v2 
Bi baie 
m i = 3 
v2 
m= i+ 3 


= 1.31 or 0.541 (3 5.f) 
median = 0.541 (3 sf) 
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Note | The median 


is also sometimes 
written as Q,. Fora 
symmetrical distribution, 
the median is equal to 
the mean. 


Example 13) 


A continuous random variable X has the cumulative distribution function 


Find: 
a the interquartile range 
b the 10th percentile. 


Continuous distributions 


a F2)=H2)-7=02 
Lower quartile 


Or a. , 

70.7 5 157025 
OF =20,428225 
0,7 - 20,-0.5 =0 


22v4+2 
Q===5* 


1 = 2.22 (3 sf) 


Upper quartile 
Q3*° Qs. 4 
10 = — + eo = O75 
037 - 203+ 2=7.5 
O37 = 203 = 5.5 = O 


224Vv4+22 
ar a 


OQ; = 2.22 of =O1225 (3 St) 


0, 2355 or=l55 (3 Si) 


Problem-solving 


If a c.d.f. is defined piecewise it is a good idea to 
find the boundary values for each section of F(x). 
This will tell you which section of the function 

to use when calculating the median, quartiles or 
percentiles. 


Oz = 3.55 (3 54)* 


Interquartile range = 3.55 - 2.22 
= 1.33 (3 5.f) 


b HP io) = $ _ OF 
Pi = 15° 
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You can use the concept of skewness to describe the symmetry of a distribution, or lack thereof. 
These probability density functions show examples of different types of skewness. 


F(x) A 


av 


O 


This distribution has a 
longer ‘tail’ at its right-hand 
end, and the mass of the 
distribution is concentrated 
at the left-hand end. It is 
positively skewed. 


F(x) A 


O x 


If a distribution is symmetrical 
it can be described as being 
unskewed or having no skew. 


f(x) A 


O x 


This distribution has a 
longer ‘tail’ at its left-hand 
end, and the mass of the 
distribution is concentrated 
at the right-hand end. It is 
negatively skewed. 


In many cases, you can use the following rules to determine whether a distribution is positively or 


negatively skewed. 


= Positive skew: mode < median < mean 
= Negative skew: mean < median < mode 


Example 14) 


A continuous random variable X has probability density function 


1 
10 * 


as 1 
MODE gags 
0 

a Find: 
i the mean of X 
ii the mode of X. 


Note ) You can also refer to a sketch of the p.df. 
to describe skewness. Comparing one pair of 


measures of central tendency, or using a sketch, 


will be sufficient to justify skewness in your exam. 


O<x<2 


2<=x<10 
otherwise 


b Comment on the skewness of the distribution. 


2 10 
; aie (5 a ctl 2) 
ai [ortax + f 4% =F x Ax 


oe ke [i 2 1° 3]"° 
= [go x9] + lax 120 *"|> 
—~8&, (25 _ 13 
a al 


Mean of X = = or 6.4 
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Continuous distributions 


b Mode < mean so the distribution is 


ii f(x) 


0.2 


O 2 10 x 
From the sketch, the mode of X is 2. 


positively skewed. « 


Exercise 3D) 


1 The continuous random variable X has probability density function given by 


3 
foe faa 0<x<4 
0 otherwise 
a Sketch the probability density function of X. 


b Find the mode of X. 


The continuous random variable X has probability density function given by 
1 
7 0O<x<4 

f(x) = | gx x 


0 otherwise 


a Find the cumulative distribution function of X. 

b Find the following, giving your answers to 3 significant figures: 
i the median of X 
ii the 10th percentile of X 
iii the 80th percentile of X. 


The continuous random variable X¥ has cumulative distribution function given by 


0 x<0 
x 0O<x<2 
F(x) = x2 
—3 + 2x-2 2<x<3 
1 x>3 


Find the following, giving your answers to 3 decimal places: 
a the median value of X 
b the quartiles and the interquartile range of YX. 
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4 


7 
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The continuous random variable XY has probability density function given by 


mye {1 0<x<2 
0 otherwise 
a Sketch the probability density function of X. 
b Write down the mode of X. 
Find: 
c the cumulative distribution function of X 
d the median value of X 
e the upper quartile 
f the Sth percentile, giving your answer correct to 3 significant figures. 


The continuous random variable Y has probability density function given by 


1 1 
SS ay O<xys3 
mv={) : 


0 otherwise 


a Sketch the probability density function of Y. 

b Use your sketch to describe the skewness of Y. 

c Write down the mode of Y. 

Find: 

d the cumulative distribution function of Y 

e the median value of Y correct to 3 significant figures 

f the 10th to 90th percentile range, correct to 3 significant figures. 


The continuous random variable X¥ has probability density function given by 


1 
ae 0<x<2 


f(x) = | 
0 otherwise 

a Sketch the probability density function of X. 

b Write down the mode of YX. 

Find: 

c the cumulative distribution function of X 

d the median value of X. 


The continuous random variable XY has probability density function given by 


3 
raya | HOP*D -l<x<l 


0 otherwise 


© 10 


a Sketch the probability density function of X. 
b What can you say about the mode of X? 


c Write down the median value of YX. 
d Find the cumulative distribution function of X. 


The continuous random variable XY has probability density function given by 


3 
rey ={ #O—™ 0<x<2 


0 otherwise 


a Sketch the probability density function of X. 

b Use your sketch to describe the skewness of X. 

Find: 

c the mode of X 

d the cumulative distribution function of YX. 

e Show that the median value of X lies between 1.23 and 1.24. 


The continuous random variable X has cumulative distribution function given by 


0 x< 1 
FX =4¢02-1) 1 <x<3 
1 X38 


Find: 

a the probability density function of the random variable Y 

b the mode of X 

c the median of YX. 

d Describe the skewness of X, giving a reason for your answer. 
e Find the value k such that P(k <x <k+1)=0.6 


The continuous random variable X has cumulative distribution function given by 


0 x <0 
F(x) = 4 403-3x4 0S x<1 
1 x>1 


Find: 

a the probability density function of the random variable Y 
b the mode of X 

¢ PO2=<xX = 0,5) 


Continuous distributions 


(2 marks) 
(2 marks) 
(2 marks) 
(2 marks) 
(3 marks) 


(3 marks) 
(2 marks) 
(3 marks) 
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EP) 12 


@) 13 


(E/P) 14 


15 
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The amount of vegetables eaten by a family in a week is a continuous random variable Wkg. 
The continuous random variable W has probability density function given by 


Sv-w) 0<ws5 


f(w) = 
0 otherwise 
a Find the cumulative distribution function of the random variable W. (3 marks) 
b Show that the median of W lies between 3.4 kg and 3.5kg. (3 marks) 
c Find the mode of W, fully justifying your answer. (4 marks) 


d Hence describe the skewness of the distribution, giving a reason for your answer. _— (1 mark) 


The continuous random variable XY has a probability density function given by 
0<x<1l 
f(x) = l=x=2 


a 
0 otherwise 


Find: 

a E(X) (5 marks) 
b the cumulative distribution function (4 marks) 
c to 3 decimal places, the median and the interquartile range of the distribution. (5 marks) 
d Describe the skewness of this distribution, giving a reason for your answer. (1 mark) 


For each of the following sets of conditions, sketch the probability density function of 
a distribution which satisfies all the conditions: 


a the distribution is symmetrical but mode 4 median 
b there is a unique mode which lies outside the interquartile range. 


By fully specifying a suitable p.d.f. give an example of a non-symmetrical distribution in which 
the median and the mode are equal. (3 marks) 


The continuous random variable X has probability density function given by 


1 
f(x) = xIn5 i ee: 

0 otherwise 
a Find the mode of_X, fully justifying your answer. (2 marks) 
b Specify the cumulative distribution function of X. (3 marks) 
Find: 
c the exact value of the median of X (2 marks) 
d the quartiles and the interquartile range of X. (3 marks) 


716 The life, YX, of the Nitelite light bulb is modelled by the probability density function 


(E/P) 


E/P) 17 


E/P) 18 


2.5e72:5* x20 
= 0 otherwise 
where X is measured in thousands of hours. 


Find: 
a the median of X¥ 
b the quartiles and the interquartile range of YX. 


The continuous random variable XY has probability density function given by 
ksec?(1x) 0<x<0.25 
so 0 otherwise 
Find: 


a the value of k 
b the cumulative distribution function of X 
c the median of X. 


The continuous random variable XY has probability density function given by 


k 
f(x) =1 x6—%) 2=x=<x4 

0 otherwise 

Find: 

a the exact value of k 

b E(x) 

ec Var(X) 

d thec.d.f. of X 

e the median of YX. 

f Write down the mode of YX. 

g Comment on the skewness of this distribution. 


[3.5 | The continuous uniform distribution 


=» Arandom variable having a continuous uniform 
distribution over the interval [a, 5] has p.d.f. 


A sketch of the p.d.f. is shown. 


1 
f(x) ={ b-a 
0 otherwise 


asxx=xb X ~ Ufa, b]. 


Continuous distributions 


Area=1 


(5 marks) 
(3 marks) 


(3 marks) 
(2 marks) 
(2 marks) 


(4 marks) 
(3 marks) 
(4 marks) 
(4 marks) 
(2 marks) 

(1 mark) 
(2 marks) 


CEES if xhas the 


continuous uniform distribution 
over the interval [a, b] you write 
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The continuous random variable X is uniformly distributed on [3, 5]. Find: 
a P(3.2 < X¥ < 4,3) b kK such that P(2X <k- X)=0.2. 


1 1 Note } When dealing with a uniform distribution it 
is easier to sketch the p.d.f. and work out the area 
of the rectangle. You can also use proportion: you 
are interested in a range of values of width 1.1 

out of an interval of width 2, so the probability is 


O| 2329 42 5 * 


P(3.2 < X¥ < 4.3) = (4.3 - 3.2) x 0.5 


= G55 
b ax aha d 
X<4k- 
So P(X < 4k) = 0.2 
0.5($k - 3) =0.2 
k=10.2 - 


The continuous random variable X has p.d.f. as shown in the diagram. 
Find: 

a the value of k b PB <X<3.5) 

c P(X > 3|X> 2) 


a Area=1 
O4Ax(4-k)=1 
4-k=2.5 


kets 
bPG=Yeas5304% 6S =3)=02 
eS ene Ss: 


c PX >3|X>2)= 


POY > 2) 
_ P(X > 3) 
"FAN 2) z 
“O08 2 To solve conditional probability problems with 


a continuous uniform distribution, you can use a 
continuous uniform distribution on a restricted 
sample space. Given that XY > 2, the value of X is 
uniformly distributed on [2, 4]. 
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Example 


The continuous random variable X has probability density function 
i 


f(x) -| 5 


0 otherwise 


3<x<8 


a Write down the name of this distribution. 
The continuous random variable Y = 12 — 3.X. 


Find: 
b E(Y) ec P(Y>0) d Find PLY <7| Y <0) 
a Continuous uniform distribution 
= 12 - 3E(X) 
a2 = 3:% 55: 
_ eee | 
c P(Y>O)=P(12-3X >0)- 
=Fl2 3X) 
=P(4 > X) 
= z ‘ 
_Pas<xX <7) 
PUY > 2h) 
_ 0G 
~ 0.8 
_3 Problem-solving 
= 4 


You could also tackle this problem by finding the 
distribution of Y. A linear transformation of a 
uniform distribution will be uniform, so Y ~ U[-12, 3]. 


The continuous random variable X has probability density function 
ee 

f(x)= 4{b-a 

0 otherwise 


axx<b 


Find: 
a E(X) b Var(X) c F(x) 
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TX) 


a 


x 


aD 
es 

= . 
b-a 


By symmetry E(X) = 


b Var(X) = i" 


(b- a)? (b- a) 
8. 8 
3(b - a) 

(b - a)? 
- 4 
~ 3(b -— a) 
_ (b-a)° 
~ 12(b - a) 
_ =a 


12 
* 4 
fos = 6; Fa) =) =—ade 
é li@=x (x) i ar 
seal. 
b-ala t Watch out ] The cumulative distribution function 
eee for a uniform continuous distribution is not 
“b-a given in the formulae booklet. It can be useful 
é 2 to remember it, but make sure you know how 
nee to derive it from first principles as shown here. 
Fx)= {2°42 agaxsd 
pe 
| baa) 
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= For a continuous uniform distribution U[a, 5] 


E(x) = 424 
b- a)? 
: var(x) = C=" 
0 x<a 
e F(X) = aa asxx<xb 
x>b 


1 


The continuous random variable X is uniformly distributed over the interval [4, 7]. 
Find: 

a E(X) b Var(X) 

c the cumulative distribution function of X, for all x. 


a E(x)=7¢4=55 

B var? = =32 

. Fay = ft oat= is] =<" 
O x= 
x-4 


1 


The continuous random variable Y is uniformly distributed over the interval [a, 5]. 
Given that E(Y) = 1 and Var(Y) = 2 find the value of a and the value of b. 


Problem-solving 
E(Y) = a+b “4 
2 Use the formulae for the mean and variance 
a+b=2 (1) of a continuous uniform distribution to form 
(b=@? de simultaneous equations in a and b. 
Var (¥) = —5—— = § 
(b — a)? = G4 (2) 
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Solving equations (1) and (2) simultaneously 

b=2-a 

(2-a-a)’*=64 

(2 - 2a) = +8 

2-2a=8 2-2a=-8 

a=-3 G=5 
b= 2 -(-3) b=2-5 
=5 = -3 

Sincea<b,a=-3 andb=5. 


The continuous variable X is uniformly distributed over the interval [—3, 5]. 


a Write down E(X). 
b Use integration to find the variance of X. 


=o 
ae 


b Var(X) = E(X2) - (E(X))? 


a E(X) = i 


5 x2 5 
= [ Sax-1 


Exercise (3E) 


1 The continuous random variable XY ~ U[2, 7]. 


Find: 
a P3<X<5) b P(X¥>4) 


2 The continuous random variable X has a probability density function 


as shown in the diagram. 
Find: 


a the value of k b P4<X<7.9) 
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3 The continuous random variable X has probability density function 


; —2<x<6 
Ci 0 otherwise 


Continuous distributions 


Find: 
a the value of k b P(-1.3 < X¥ < 4.2) c psuch that P(3X¥ < X + p)=0.5 
d P(X >5|X>0) e P(X >0|X <3) f PX¥<1|0<X<2) 


4 The continuous random variable Y ~ Ua, 5]. Given that P(Y < 5) = - and P(Y > 7) = 4, 


find the value of a and the value of D. 


5 The continuous random variable X ~ U[2, 8]. Hint ) If X has the continuous uniform distribution 
a Write down the distribution of Y=2X + 5. then aX + b where a and b are constants will also 
b Find P(12 < ¥Y<20). have the continuous uniform distribution. 


6 The continuous random variable X has probability density function 


no ={ 2<x<12 
0 otherwise 
a Write down the name of this distribution. (1 mark) 
The continuous random variable Y = 20 - 2X. 
Find: 
b E(Y) (2 marks) 
c P(Y<4) (2 marks) 
d P(Y>4|X< 10) (3 marks) 
7 The continuous variable Y is uniformly distributed over the interval [-3, 5]. 
Find: 
a E(XY) (1 mark) 
b Var(X) (1 mark) 
c E(x’) (2 marks) 
d the cumulative distribution function of X, for all x. (3 marks) 
8 Find E(X) and Var(X) for the following probability density functions. 
e={! l<=x=5 * wo={3 —2=x <6 
0 otherwise 0 otherwise 
9 The continuous random variable X has probability density function f(x) 
as shown in the diagram. 
Find: 7 — 
a E(X) b Var(X) es 
c E(x’) d the cumulative distribution function of X, for all x. on 55 * 


EP) 10 


@p) 11 


12 


EP) 13 


EP) 14 


© 15 
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The continuous random variable Y ~ U[a, b]. Given that E(Y) = 1 and Var(Y) = 4 find the 


value of a and the value of b. 


The continuous random variable X has probability density function 


1 
a -l<x<5 
m=" 7 


0 otherwise 


Given that Y = 4X —- 6, find E(Y) and Var(Y). 


The continuous random variable X is uniformly distributed over the interval [a, 3]. 
Find: 
a P(X <4a+ 4) b P(X > 2a +3) 


(3 marks) 


(3 marks) 


The continuous random variable R is uniformly distributed on the interval a < RS #3. 


Given that E(R) = 5 and Var(R) = 4 find: 
a the value of a and the value of 3 
b P(R < 5.2) 


(3 marks) 
(1 mark) 


The continuous random variable X is uniformly distributed over the interval a < R < (3. 


a Write down the probability density function of X, for all x. 
b Given that E(Y) = 2.5 and P(Y < 1) = + find the value of a and the value of (3. 


The continuous random variable X is uniformly distributed over the interval [-—S, 4]. 


a Write down fully the probability density function f(x) of X. 
b Sketch the probability density function f(x) of X. 

Find: 

ec E(x’) 

d P(-0.2 < X¥ < 0.6) 


A continuous random variable X has cumulative distribution function 


0 x<-3 
F(x) = “2 -3<x<4 
1 x>4 


a Find P(X < 0). 

b Find the probability density function f(x) of X. 
c Write down the name of the distribution of YX. 
d Find the mean and the variance of Y. 


(1 mark) 
(3 marks) 


(2 marks) 
(2 marks) 


(2 marks) 
(2 marks) 


(1 mark) 
(2 marks) 
(1 mark) 
(3 marks) 


Continuous distributions 


17 The continuous random variable X is uniformly distributed over the interval [-1, 4]. 


Find: 

a E(X) (2 marks) 
b Var (XY) (2 marks) 
c E(x?) (2 marks) 
d P(X < 1.4) (1 mark) 


A total of 6 observations of X are made. 
e Find the probability that exactly 4 of these observations are less than 1.4. (2 marks) 


18 The continuous random variable X is uniformly distributed over the interval [a, (]. 
Given that E(Y) = 7.5 and PLY > 10.5) = 0.25 


a find the value of a and the value of (3. (3 marks) 
b Given that PLY < c) = 3 find: 


i the value of c 
ii Pic < X¥ <9) (3 marks) 


€ Modelling with the continuous uniform distribution 


The continuous uniform distribution is frequently used to model real-life situations. For example, 

if you know that trains leave from a station hourly, but you arrive not knowing when the next train 
will leave, then the length of time you have to wait after arriving at the station, X minutes, could be 
modelled as XY ~ U[0, 60]. 


The trunk of a small tree varies in diameter from 10cm at the bottom to 2cm at the top. The tree is 
cut horizontally at a randomly chosen point, and the radius Rem of the cross-section is modelled 
as R~ Uf], 5]. 


Find the expected value of the area, A, of the cross-section of the tree. 


A=q7R? 


E(A) = E(rR? 
- = aie t Watch out | E(R2) is not the same as (E(R))* 


Var (R) = E(R?) — (E(R))? 
Rearranging gives 
E(R®) = Var(R) + (E(R))? 


(BSF oa 
Var uR)<= as 
5 +1 
ECR):= 3 a3) 

4 31 
E(R?)=3+9=5 
310 

E(A) = ~~ 
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Example 


The length of a pencil is measured to the nearest cm. Write down the distribution of the rounding 
errors R. 


The error is the difference between the true 
length and the recorded length. 
If a pencil is recorded as 20cm long then its 
length is anywhere in the interval 


Sen = leigh < 2O5cn Note ) The uniform distribution is often used 


as a model for errors made by rounding up 


The error is therefore in the interval or down when recording measurements. 


-0.5 = error < 0.5 

As it is reasonable to assume that the error is 

equally likely to take any of the values in this range 
R~ VU[-0.5, 0.5] 


Example 


d 


Write down the name of the distribution you would recommend as a suitable model for each of the 
following situations. 


a The masses of 200g tins of tomatoes produced on a production line. 


b The difference between the true length and the length of metal rods measured to the nearest 
centimetre. 


b Continuous uniform + 


Exercise 


! 


1 The random variable X is the length of a side of a square and is modelled as ¥ ~ U[4.5, 5.5]. 
The random variable Y is the area of the square. 
Find E(Y). 


2 The random variable R has a continuous uniform distribution over the interval [5, 11]. 
a Specify fully the probability density function of R. 
b Find P(7 < R < 10). 
The random variable A is the area of a circle radius Rem. 
ec Find E(A). 
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Continuous distributions 


In acomputer game, an alien appears every 2 seconds. The player stops the alien by pressing 
a key. The object of the game is to stop the alien as soon as it appears. Given that the player 
actually presses the key T seconds after the alien first appears, a simple model of the game 
assumes that 7 is a continuous uniform random variable defined over the interval [0, 1]. 


a Write down P(T < 0.2). (1 mark) 
b Write down E(T). (2 marks) 
c Use integration to find Var (T). (3 marks) 


The time in minutes that Priya takes to check out at her local supermarket follows a continuous 
uniform distribution defined over the interval [2, 10]. 


Find: 
a the probability that Priya will take more than 7 minutes to check out on one visit to the 
supermarket (2 marks) 


b the probability that Priya will take less than 5 minutes to check out on each of three successive 
visits to the supermarket (3 marks) 


Given that Priya has already spent 5 minutes at the checkout, 


c find the probability that she will take a total of less than 8 minutes to check out. (2 marks) 


A drinks machine dispenses coffee into cups. It is electronically controlled to cut off the flow 
of coffee randomly between 175 ml and 215ml. The random variable X is the volume of 
coffee dispensed into a cup, and is uniformly distributed. 
a Specify the probability density function of X and sketch its graph. (3 marks) 
b Find the probability that the machine dispenses: 
i less than 187ml 


ii exactly 187 ml. (3 marks) 
c Calculate the interquartile range of X. (3 marks) 
d Determine the value of x such that PLY = x) = 0.65 (2 marks) 


Brenda buys five cups of coffee from the drinks machine Hint ) ees BinomiEhacaibanon 
for people in her office. 


e Find the probability that exactly three of the cups contain less than 187 ml. (3 marks) 


The continuous random variable XY represents the error, in mm, made when a machine cuts iron 
rods to a target length. X has a continuous uniform distribution over the interval [-3.0, 3.0]. 


Find: 

a P(X < -2.3) (2 marks) 
b P(|X| > 2.0) (2 marks) 
Ten rods are cut. 


ce Calculate the probability that exactly six are cut within 2mm of the target length. (3 marks) 
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A manufacturer produces sweets of length Ymm where Y has a continuous uniform distribution 
with range [20, 28]. 
a Find the probability that a randomly selected sweet has length greater than 26mm. (2 marks) 


These sweets are randomly packed in bags of 20 sweets. 


b Find the probability that a randomly selected bag will contain at least 7 sweets with length 
greater than 26mm. (3 marks) 


The waiting times, in minutes, between flight take-offs at an airport are modelled by the 
continuous random variable XY with probability density function 


1 
5 2<=x<7 


f(x) = | 


0 otherwise 


A randomly selected flight takes off at 10am. 


a Find the probability that the next flight takes off before 10:05 am (2 marks) 
b Find the probability that at least 3 of the next 10 flights have a waiting time of more 
than 6 minutes. (3 marks) 


A wooden dowelling rod of length 20cm is cut into two pieces at a randomly chosen point. The 
length of the longer piece, X cm, is modelled as having a continuous uniform distribution over 
the interval [10, 20]. 


The two pieces of the dowelling rod are used to from the base and height of a rectangle, as 
shown below. 


< Xem > 


Find the expected area of the rectangle. (6 marks) 


Mixed exercise S 


1 
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The random variable X has probability density function f(x) given by 
ie a a 
f(x) = (1 + a Q0=x=2 
0 otherwise 
Find: 
a E(X) and E(3X + 2) b Var(X) and Var(3X + 2) 
é Pee) d P(X > E(X)) 


e P(0.5 < X¥ < 1.5) 


2 


Continuous distributions 


The random variable X has probability density function f(x) given by 


2-2x O<x<1 
AO 0 otherwise 
a Evaluate E(X). 
b Evaluate Var (X). 
c Write down the values of E(2X + 1) and Var(2X + 1). 
d Specify fully the cumulative distribution function of YX. 


e Work out the median value of X. 


The continuous random variable Y has cumulative distribution function given by 


0 y<l 
F(y)=)k(y?-y) 1<y<2 
1 y>2 
where k is a positive constant. 
a Show that k = +4 b Find P(Y < 1.5). 
c Find the value of the median. d Specify fully the probability density function f(y). 


The continuous random variable XY has cumulative distribution function 


0 x<2 
F(x) =4 $0?-4)  2<x<3 
1 x>3 
Find: 
a P(Y¥> 2.4) 


b the median 

c the probability density function, f(x). 
d Evaluate E(Y). 

e Find the mode of X. 


The random variable X has probability density function f(x) given by 


F ea Vee? 
(x) = 0 otherwise 


where k is a positive constant. 


a Show that & = 3 (1 mark) 
b Calculate E(Y). (3 marks) 
c Specify fully the cumulative distribution function of X. (4 marks) 
d Find the value of the median. (2 marks) 
e Find the value of the mode. (1 mark) 
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The random variable Y has probability density function f(y) given by 


my (te ee daa 
0 otherwise 
where k is a positive constant. 
a Show that k = = 
b Specify fully the cumulative distribution function of Y. 
c Evaluate P(Y < 2). 


A random variable X has probability density function f(x) given by 


may ={ BA 2<x<2 
0 otherwise 
a Sketch the probability density function of X. 
b Write down the mode of YX. 
c Specify fully the cumulative distribution function of X. 


d Find P(0.5 < ¥< 1.5). 


A random variable X has probability density function f(x) given by 
3 0<x<l 
fix)=4 2x2 1<x<2 
0 otherwise 
a Find E(X). 


b Specify fully the cumulative distribution function of X. 


c Find: 
i the median of X 
ii the 15th percentile of X. 


d Describe the skewness of the distribution, giving a reason for your answer. 


The continuous random variable X has cumulative distribution function 
0 x<l 
F(X) =) 0.05a-b 1<x<2 
1 x >2 
where a and b are positive constants. 


Find a and b, showing your working clearly. 


(2 marks) 
(4 marks) 
(3 marks) 


(2 marks) 

(1 mark) 
(4 marks) 
(3 marks) 


(3 marks) 
(4 marks) 


(3 marks) 
(1 mark) 


(7 marks) 


Continuous distributions 


10 A student writes the following cumulative distribution function for a continuous random 
variable X. 


pr) 11 


O x<5 
F(x) = ¢2(16x -x2-55) 5<x<10 


| x > 10 


Explain why this cannot be a cumulative distribution function. 


(2 marks) 


A continuous random variable X has probability density function f(x) given by 


; a 1<xs<3 
= 0 otherwise 


where k is a positive constant. 
a Show that k = = 
b Find E(X). 


c Work out the cumulative distribution function, F(x). 
d Show that the median value lies between 2.4 and 2.5. 
e Hence comment on the skewness of the distribution. 


(2 marks) 
(3 marks) 
(4 marks) 
(3 marks) 

(1 mark) 


f Find the 10th to 90th percentile range, giving your answer correct to 3 significant 


figures. 


(2 marks) 


The continuous random variable X has probability density function given by 


x O<xx<!l 
3x2 

f(x)=\ qq l<x=2 
0 otherwise 


a Sketch the probability density function of X. 
b Find the mode of X. 

ec Find E(2X). 

d Find Var(2X + 1). 


e Specify fully the cumulative distribution function of X. 


f Using your answer to part e, find the median of X. 


(2 marks) 

(1 mark) 
(3 marks) 
(3 marks) 
(4 marks) 
(2 marks) 
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The continuous random variable X has probability density function f(x) given by 


ae 0O<sx<2 


0 otherwise 
a Sketch the graph of f(x) for all values of x. (2 marks) 
b Write down the mode of YX. (1 mark) 
c Show that PLY > 2) = 0.75 (2 marks) 
d Define fully the cumulative distribution function F(x). (4 marks) 
e Find the median of YX. (3 marks) 


The continuous random variable XY has cumulative distribution function F(x) given by 


0 Eee 
F(x) = 4 q(-2x3 + 15x?-44) _2<x<5 
1 x>5 
a Find the probability density function f(x). (3 marks) 
b Find the mode of X. (2 marks) 
c Sketch f(x) for all values of x. (3 marks) 
d Find the mean pu of X. (3 marks) 
e Show that F(y) > 0.5. (1 mark) 
f Show that the median of X lies between the mode and the mean. 
Hence describe the skewness of the distribution. (3 marks) 


A continuous random variable X has cumulative distribution function F(x) given by 


0 x<0 
F(x) = 4 K(35x - 2x7) 0<xs5 

1 x> 5 
a Show that k = hs (1 mark) 
b Find the median of X. (3 marks) 
c Find the probability density function f(x). (3 marks) 
d Sketch f(x) for all values of x. (3 marks) 
e Write down the mode of YX. (1 mark) 
f Find E(X). (3 marks) 
g Comment on the skewness of this distribution. (2 marks) 


The continuous random variable X has probability density function f(x) given by 


‘ t +b O0<x<2 
(x) = 0 otherwise 


If E(XY) = 2, find the values of a and b. 
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A continuous random variable X has probability density function f(x) where 


k(x+1)? -l<x<0 
= 0 otherwise 

where k is a positive integer. 

a Show that k =4 (3 marks) 
Find: 

b E(Y) (4 marks) 
c the cumulative distribution function F(x) (4 marks) 
d the median. (3 marks) 


The continuous random variable X is uniformly distributed over the interval [—2, 5]. 
a Sketch the probability density function f(x) of X. 


Find: 

b E(Y) ce Var(X) 

d the cumulative distribution function of _X, for all x 

e P(3.5< X<5.5) f PXY=4) 

g P(XY>0/X <2) h P(XY > 3|X>0) 

The continuous random variable X has p.d.f. as shown f(x)4 

in the diagram. 

Find: 0.2 

a the value of k b P(-2< X¥<-l) ' 

© E(X) d Var(X) er ae 


e the cumulative distribution function of X, for all x. 


The continuous random variable Y is uniformly distributed on the interval a = Y Sb. 
Given E(Y) = 2 and Var(Y) = 3, find: 
a the value of a and the value of b b P(Y > 1.8) 


The continuous random variable X has a continuous uniform distribution on the interval [0, 2] 
The continuous random variable Y= 10-5X. 


a Describe the distribution of Y. (2 marks) 
b Find P(Y < 3). (2 marks) 
@ Find P(Y¥>3|X > 0,5). (3 marks) 


A child has a pair of scissors and a piece of string 20cm long which has a mark on one end. 
The child cuts the string, at a randomly chosen point, into two pieces. Let XY represent the 
length of the piece of string with the mark on it. 

a Write down the name of the probability distribution of X and sketch the graph of its 


probability density function. (3 marks) 
b Find the values of E(Y) and Var(X). (4 marks) 
c Using your model, calculate the probability that the shorter piece of string is at least 

8 cm long. (3 marks) 
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Joan records the temperature every day. The highest temperature she recorded was 29 °C to the 
nearest degree. Let X represent the error in the measured temperature. 


a Suggest a suitable model for the distribution of X. (1 mark) 
b Using your model, calculate the probability that the error will be less than 0.2°C. (3 marks) 
c Find the variance of the error in the measured temperature. (2 marks) 


Jameil catches a bus to work every morning. According to the timetable the bus is due at 
9am, but Jameil knows that the bus can arrive at a random time between three minutes early 
and ten minutes late. The random variable X represents the time, in minutes, after 9am when 
the bus arrives. 


a Suggest a suitable model for the distribution of X and specify it fully. (2 marks) 
b Calculate the mean value of YX. (2 marks) 
c Find the cumulative distribution function of X. (4 marks) 


Jameil will be late for work if the bus arrives after 9:05 am. 
d Find the probability that Jameil is late for work. (2 marks) 


A plumber measures, to the nearest cm, the lengths of pipes. 


a Suggest a suitable model to represent the difference between the true lengths and the 
measured lengths. (1 mark) 


b Find the probability that for a randomly chosen rod the measured length will be within 
0.2. cm of the true length. (2 marks) 


c Three pipes are selected at random. Find the probability that the measured lengths of all 
three pipes will be within 0.2 cm of the true length. (2 marks) 


A coffee machine dispenses coffee into cups. It is electronically controlled to cut off the flow 
of coffee randomly between 190 ml and 210 ml. The random variable X is the volume of coffee 
dispensed into a cup. 


a Specify the probability density function of XY and sketch its graph. (3 marks) 


b Find the probability that the machine dispenses 
i less than 198 ml 
ii exactly 198 ml. (3 marks) 


c Calculate the interquartile range of YX. (2 marks) 


d Given that the machine has already dispensed 195 ml coffee into a cup, find the 
probability that it will dispense more than 200 ml into that cup. (2 marks) 


Write down the name of the distribution you would recommend as a suitable model for each of 
the following situations: 


a the difference between the true height and the height measured, to the nearest cm, 
of randomly chosen people 


b the heights of randomly selected 18-year-old females. 
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Continuous distributions 


The delay in departure, T hours, of a flight from Statistics airport is modelled by the 
probability density function 


m= {2e-” 0<1<6 
0 otherwise 


a Find the cumulative distribution function F(f). 
b Find the median value of T. 


c Find E(T). 

A continuous random variable X is uniform on the interval [b, 55]. 

a Write down the probability density function of YX. (2 marks) 
b Write down the value of E(X). (1 mark) 
c Show by integration that Var(Y) = 1 (3 marks) 
Given that b = 3 

d find PLY >10) (2 marks) 
Five observations are taken from this distribution. 

e Find the probability that exactly three of them are bigger than 10. (4 marks) 


The continuous random variable XY has probability density function given by 


2 
—*_ -j<xx< 
f(x) = Oe-tins 
0 otherwise 
Find F(x). (3 marks) 


As part of its employee selection process, ROBU Bank sets an aptitude test. Over the years it 
has found that the percentage scored X (measured in 100s) by prospective employees can be 
modelled by the probability density function f(x), where 


f(x) = " sin(7x) OSx - 1 
0 otherwise 
Find: 
a the value of k (5 marks) 
b E(Y) (5 marks) 


A random variable X has probability density function given by 
k 0<x<1 


k 
f(x) = 2 1<x<2 


0 otherwise 
a Find the value of k. (3 marks) 
b Calculate the value of ECY). (4 marks) 
c Calculate the value of Var(X). (4 marks) 
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Challenge 


1 A spinner is made using a circle of radius r, and a pointer of length r which is 
pivoted at the centre of the circle. The pointer is spun and allowed to come 
to rest. The random variable 6 represents the angle between the vertical and 
the resting position of the spinner, and the random variable X represents the 
horizontal distance of the end of the spinner from the centre of the circle. 


a Describe a suitable distribution to model @. 
b Hence, or otherwise, find E(X). 


c Briefly explain how this spinner could be used as part of an experiment to 
estimate the value of z. 


A continuous random variable X having a probability density function f(x), where 


fa\= Ae 0) 
0 otherwise 


where J is a positive constant, is said to follow an exponential distribution. 
Show that: 


1 1 
a E(X) =7 and Var(X) =p 


b P(X¥>a+b|X>a)=P(X>)) 


Summary of key points 


90 


1 If Xis a continuous random variable with probability density function f(x), then 


- f(x) =Oforallx ER 
b 
»Pa<X<b)={ fox)dx 


» [ fl) dx=1 


2 Forarandom variable _X, the cumulative distribution function F(x) = P(Y S x). 


3 If Xis a continuous random variable with c.d.f. F(x) and p.d_f. f(x): 


d x 
f(x) = FF) and F(x) =f f(0 de 


Continuous distributions 


4 \f X is acontinuous random variable with probability density function f(x): 


* the mean or expected value of X is given by 
E(X) = ps =| xf(x) dx 
> the variance of X is given by 


Var(X) = 02 = & — p)? f(x) dx 
= = F(x) dx — p2 
5 If X is a continuous random variable, then E(g(X)) = [se F(x) dx 


6 You can calculate Var(X) using 
Var(X) = E(X*) — (E(X))? 


7 Inthe case where g(X) is a linear function of the form aX + b, 
* E(aX +b) =aE(X) +b 
> Var(aX + b) = a2*Var(X) 


8 The mode of a continuous random variable is the value of x for which the p.d.f. is a maximum. 


9 If Xis acontinuous random variable with c.d.f. F(x): 
* the median of XY is the value m such that F(m) = 0.5 
> the lower quartile of X is the value Q, such that F(Q,) = 0.25 


* the upper quartile of X is the value Q; such that F(Q;) = 0.75 
nN 


- the nth percentile of XY is the value P,, such that F(P,,) = 700 


10 - Positive skew: mode < median < mean 


- Negative skew: mean < median < mode 


11 A random variable having a continuous uniform distribution U[a, b] has p.d.f. 
a = = 
iG) = f -a ae 
0 otherwise 


12 Fora continuous uniform distribution U[a, 5] 


E(x) = 242 

_(b- a? 
Var(X) = 7) 

0 a 
F(x) = — a<x<b 
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Combinations of 
random variables 


After completing this chapter you should be able to: 


e@ Find the distribution of linear combinations of normal 
random variables — pages 93-98 


@ Solve modelling problems involving combinations of 
normal random variables — pages 95-98 


Prior knowledge check 


1 Arandom variable Y ~ N(20, 42). Find: 


a P(Y> 15) b P(19 < Y< 22) =. 
€SM2,Chapter3 & ' y 
Given that XY ~ N(O, 12), find: ‘ a x = 
a p such that P(X > p) =0.25 - 2 —_—> 
b qsuch that P(|X| <q) =0.7 2 = - a 
<€ SM2, Chapter 3 a , “see | 
ey a 


F P a — ie ——, = 
X is a random variable with E(Y) =7 and [IM if the times of the first- and second-place 
Var(X) = 3. Find: 


sprinters are normally distributed, then 
a E(2X) b Var(3.X — 5) the winning margin will also be normally —__F 


c E(1—-6X) € FS1, Chapter 1 distributed. — Exercise 4A, Q9 j | | 


Combinations of random variables 


at Combinations of random variables 


Two random variables are independent if the outcome of one does not affect the distribution of the 
other. You need to be able to combine random variables with different distributions. You will use these 
two results: 


= If X and Y are two random variables, then 


° E(X+ Y) =E(X) + E(Y) Note } You do not need to be able to 
* E(X— Y) =E(X) —E(Y) prove these results for your exam. 
= If X and Y are two independent random variables, then 
¢ Var(X + Y) = Var(X) + Var(Y) t Watch out ) You add the variances even 
* Var(X — Y) = Var(X) + Var(Y) when you subtract the random variables. 


If X is a random variable with E(X) = y, and Var(X) = 0,2 and Y is an independent random 
variable with E( Y) = ju. and Var(Y) = 0.7, find the mean and variance of: 


a X+Y b X-Y 
a E(XY + Y) = E(XY) + E(Y) 
= fly + ba 
Var(X + Y) = Var(X) + Var(Y) 
= 0,5 + 0° 
b EX - Y)=E(X) - E(Y) 
= My, — ba 
Var(X — Y) = Var(X) + Var(Y) 
= 0)° + 5° 


You can combine the above result with standard results about expectations and variances of 
multiples of a random variable to analyse linear combinations of independent random variables. 


= If X and Y are two random variables, then 
¢ E(aX + bY) = aE(X) + bDE(Y) 
¢ E(aX — bY) = aE(X) - bE(Y) 
= If X and Y are two independent random variables, then 
¢ Var(aX + bY) = a’Var(X) + b?Var(Y) 
¢ Var(aX — bY) = a’Var(X) + b?Var(Y) 
In this chapter you will apply these results to analyse normally distributed random variables. 
= A linear combination of normally distributed random variables is also normally distributed. 
This result allows you to fully define the distribution of a linear combination of independent normal 


random variables. 
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= If X and Y are independent random variables with X ~ N(1,, 0,7) and Y ~ N(y2, 0”), then 


© aX+bY~N(ap, + bpp, a2a,? + b?o,?) 
© aX-bY ~ N(ap, — by, ao? + b?0,”) 


You can also use this result to find the distribution of sums of identically distributed independent 


normal random variables. CED . 
= If X,, X2,... X,, are independent identically Uenelcormvanable aX, peines 
distributed random variables with the same as the random variable 1X}. 
‘ For example, Var(X, + X>) = 07 + 0% = 2 Var(X;), 
X;~ N(y, o?), then > _X; ~ N(np, no? ae 1 
cee! y ene) but Var(2.X,) = 4 Var(X,). 


The independent random variables X and Y have distributions Y ~ N(5, 27) and Y ~ N(10, 37). 
a Find the distribution of: 
iA=X+Y ii B=9X-2Y 
b Find P(B > 30). 
The independent random variables Y,, Y>, Y; and Y, all have the same distribution as Y. 
The random variable Z is defined as: 
z= yy, 
i=1 


c Find the mean and standard deviation of Z. 


ai Aw~N(5 +10, 2? + 37) 
So A ~ N(15, 13) t Watch out | When you subtract random 
variables you still add the variances. 


ii BY NOx 5-2 x 10, 92 x 2% + 22 x 3°) 
So B ~ N(25, 360) 


b P(B > 30) = 0.3961 


c Z~N(4 x 10,4 x 3?) 
E(X) = 40, Var(Z) = 36 
So Z has mean 40 and standard * 
deviation 6. 


The independent random variables Y and Y have distributions Y ~ N(25, 6) and Y ~ N(22, 10). 
Find P(X > Y). 


Lev C= X= 7 Problem-solving 


Then P(X > Y) =P(C > O) fou can compare independent normal Econ 

, variables by defining a new random variable to 
C~ N25 — 22, 6 + 10) so C~ NB, 4°) be the difference between them. If X- Y>0 
P(C > O) = 0.7734 s0 P(X > Y) = 0.7734 (4 5) then ¥ > Y,and if ¥- Y<Othen Y> VY. 
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Combinations of random variables 


Bottles of mineral water are delivered to shops in crates containing 12 bottles each. The weights of 
bottles are normally distributed with mean weight 2 kg and standard deviation 0.05 kg. The weights 
of empty crates are normally distributed with mean 2.5 kg and standard deviation 0.3 kg. 


Assuming that all random variables are independent, find the probability that a full crate will 


weigh between 26 kg and 27 kg. 


Two bottles are selected at random from a crate. Find the probability that they differ in 


weight by more than 0.1 kg. 


Find the weight, , that a full crate should have on its label so that there is a 1% chance that it 


will weigh more than M. 


a let W=X,+X5+...+ Xt C - 
where X, ~ N(2, 0.05°) and C ~ N(2.5, 0.32): 
E(W) = 12E(X) + E(C) 
=2(12°% 2) 4.2.5 
= 765 
Var(W) = 12Var(X) + Var(C) 
= (12°% 0.05*)'+ (0:34) 
= O12 
W ~ N(26.5, 0.12) 
P(26 < W< 27) = 0.8511 (4 s.f) 


b let Y= X,- Xp 
E(Y) = E(X,) — E(X2) = O 
Var(Y) = Var(X,) + Var(X5) 
= O05" + 01052 = 01005 
So Y ~ N(O, 0.005) 
P(|Y| > 0.1) = 1- P(-O.1 < ¥Y < O11) 
=1-0.8427 
= 0.1573 (4 sf) 


c Find m such that P(W > m) = 0.01 
So P(W < m) = 0.99 
m= 27.3 kg (3 5.f) 


Exercise A) 


1 Given the random variables Y ~ N(80, 3*) and Y ~ N(50, 27) where XY and Y are independent, 


find the distribution of W where: 
a W=X+Y bW=X-Y 


2 Given the random variables ¥ ~ N(45, 6), Y ~ N(54, 4) and W ~ N(49, 8) where X, Yand W 
are independent, find the distribution of R where R= X + Y+ W. 


3 Xand Y are independent normal random variables. ¥ ~ N(60, 25) and Y ~ N(50, 16). 


Find the distribution of T where: 
a T=3X b T=7Y 


ec T=3X+7Y 


d T=X-2Y 


9 


uw 


10 
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X, Yand W are independent normal random variables. ¥ ~ N(8, 2), Y ~ N(12, 3) and 

W ~ N(I5, 4). Find the distribution of A where: 

a A=X+Y+W bA=W-X c A=X-Y+3W 

d 4=3X¥+4W e A=2X-Y+W 

A, Band C are independent normal random variables. A ~ N(50, 6), B ~ N(60, 8) and 

C ~ N(80, 10). Find: 

a P(A+B< 115) b P(A+B+C> 198) c P(B+C < 138) 

d P(24+B-C<70) e P(A+3B-C> 140) f P(05<A+B< 116) 
Given the random variables ¥ ~ N(20, 5) and Y ~ N(10, 4) where X and Y are independent, find: 
a E(XY- Y) (2 marks) 
b Var(X - Y) (2 marks) 
c P(13<X- Y< 16) (2 marks) 
X and Y are independent random variables with Y ~ N(76, 15) and Y ~ N(80, 10). Find: 

a P(Y> X) b P(X> Y) 

c the probability that XY and Y differ by i less than 3 ii more than 7 

The random variable R is defined as R= X + 4Y where X¥ ~ N(8, 2”), Y ~ N(14, 32) and 


X and Y are independent. Find: 


a E(R) (2 marks) 
b Var(R) (2 marks) 
c P(R<41) (2 marks) 


The random variables Y,, Y, and Y; are independent and each has the same distribution as Y. 
3 
The random variable S is defined as sS=))¥-> 4X 
i=l 
d Find Var(S). (2 marks) 


Two runners recorded the Mean 
mean and standard deviation 
of their 100m sprint times 

in a table. 


Standard deviation 


Runner 4 _ | 13.2 seconds | 0.9 seconds 


Runner B | 12.9 seconds | 1.3 seconds 


a Assuming that each runner’s times are normally distributed, find the probability that in a 


head-to-head race, runner A will win by more than 0.5 seconds. (5 marks) 
A ‘photo finish’ occurs if the winning margin is less than 0.1 seconds. 
b Find the probability of a ‘photo finish’. (2 marks) 
A factory makes steel rods and steel tubes. The diameter of a steel rod is normally distributed 


with mean 3.55cm and standard deviation 0.02 cm. The internal diameter of a steel tube is 
normally distributed with mean 3.60 cm and standard deviation 0.02 cm. 

A rod and a tube are selected at random. Find the probability that the rod cannot pass 

through the tube. (6 marks) 


The mass of a randomly selected jar of jam is normally distributed with a mean mass of | kg 
and a standard deviation of 12 g. The jars are packed in boxes of 6 and the mass of the box is 
normally distributed with mean mass 250 g and standard deviation 10g. Find the probability 
that a randomly chosen box of 6 jars will have a mass less than 6.2 kg. (6 marks) 


AH 


(E/P) 


E/P) 13 


(E/P) 16 


Combinations of random variables 


The thickness of paperback books can be modelled as a normal random variable with mean 
2.1 cm and variance 0.39 cm. The thickness of hardback books can be modelled as a normal 
random variable with mean 4.0 cm and variance 1.56cm?. A small bookshelf is 30cm long. 


a Find the probability that a random sample of 

i 15 paperback books can be placed side-by-side on the bookshelf 

ii 5 hardback and 5 paperback books can be placed side-by-side on the bookshelf. (8 marks) 
b Find the shortest length of bookshelf needed so that there is at least a 99% chance 

that it will hold a random sample of 15 paperback books. (3 marks) 


A sweet manufacturer produces two varieties of fruit sweet, Xtras and Yummies. The masses, X” 
and Yin grams, of randomly selected Xtras and Yummies are such that 
X ~ N(30, 25) and Y ~ N(32, 16) 

a Find the probability that the mass of two randomly selected Yummies will differ by more 

than 5g. (5 marks) 
One sweet of each variety is selected at random. 
b Find the probability that the Yummy sweet has a greater mass than the Xtra. (5 marks) 
A packet contains 6 Xtras and 4 Yummies. 


c Find the probability that the average mass of the sweets in the packet lies between 
280 g and 330g. (6 marks) 


A certain brand of biscuit is individually wrapped. The mass of a biscuit can be taken to be 
normally distributed with mean 75 g and standard deviation 5 g. The mass of an individual 
wrapping is normally distributed with mean 10 g and standard deviation 2 g. Six of these 
individually wrapped biscuits are then packed together. The mass of the packing material is a 
normal random variable with mean 40 g and standard deviation 3 g. Find, to 3 decimal places, 
the probability that the total mass of the packet lies between 535 g and 565 g. (7 marks) 


The independent normal random variables X¥ and Y have distributions N(10, 27) and 
N(40, 3) respectively. The random variable Q is defined as 


O=2X+ Y 
a Find: 
i E(Q) (2 marks) 
ii Var(Q) (3 marks) 


The random variables X,, X>, X3, X4, X; are independent and all share the same distribution 
as X. The random variable R is defined as 


5 
Rey x 
t= 
b i Find the distribution of R. ii Find P(Q > R). (7 marks) 


The usable capacity of the hard drive on a games console is normally distributed with mean 
60 GB and standard deviation 2.5 GB. The amounts of storage required by games are modelled 
as being identically normally distributed with mean 5.5 GB and standard deviation 1.2 GB. 


a Chloe wants to save 10 randomly chosen games onto her empty hard drive. 


Find the probability that they will fit. (8 marks) 
b State one assumption you have made in your calculations, and comment on its 
validity. (1 mark) 
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M17 X,, X>, X, and X, are independent random variables, each with distribution N(4, 0.03). 
(E/P) The random variables Y and Z are defined as 


Y= X,+X,+ X; Z=3X, 


Find the probability that Y and Z differ by no more than 1. (5 marks) 


(E/P) 18 A builder purchases bags of sand in two sizes, large and small. Large bags have mass Lkg and 
small bags have mass Skg. L and S are independent normally distributed random variables 
with distributions N(75, 5?) and N(40, 37) respectively. 

A large and a small bag of sand are chosen at random. 

a Find the probability that the mass of the small bag is more than half the mass of the large 
bag. (6 marks) 

The builder purchases 10 small bags of sand. The total mass of these bags is represented by the 

random variable M. 


b Find P(|M — 400| < 5). (5 marks), 
Challenge Hint ) You may make use of the 
For independent random variables Y and Y, E(XY) = E(X)E(Y). fact that for any two random 


Use this result to prove that if ¥ and Y are independent random Vettaoles Br 2) SIE CG) SU) 


variables, then Var(X¥ + Y) = Var(X) + Var(Y). 


Summary of key points 


1 If X and Yare two random variables, then 


EX) E(X) 4 E() ec EX= X) SEX) EY) 
2 \f Xand Yare two independent random variables, then 
* Var(X + Y) = Var(X) + Var(Y) « Var(X — Y) = Var(X) + Var(Y) 


3 If X and Yare two random variables, then 
> E(aX + bY) = aE(X) + bE(Y) 
> E(aX — bY) = aE(X) - bE(Y) 
4 lf X and Yare two independent random variables, then 
- Var(aX + bY) = aVar(X) + b2Var(Y) 
* Var(aX — bY) = aVar(X) + b2Var(Y) 
5 Alinear combination of normally distributed random variables is also normally distributed. 
6 If X and Yare independent random variables with Y ~ N(j,, 0,2) and Y ~ N(yp, a2), then 
» aX +bY~ N(ap, + bp, a2a,? + b*a>*) 
» aX — bY ~ N(ap, — bp, a®a,? + b*a>*) 
7 If X;, X>,... X, are independent identically distributed random variables with X; ~ N(u, 0%), then 
om ~ N(np, no?) 
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Review exercise 


A long distance lorry driver recorded the 
distance travelled, m miles, and the amount 
of fuel used, flitres, each day. Summarised 
below are data from the driver’s records for 
arandom sample of 8 days. 


The data are coded such that 
x=m-— 250 and y=f- 100. 
The data collected can be summarised as 
follows: 
¥ <= 130 
oxy = 8880 


diy = 48 
S,. = 20487.5 


a Find the equation of the regression line 
of y on x in the form y =a + bx. (2) 


b Hence find the equation of the 
regression line of fon m. (3) 
c Predict the amount of fuel used on a 
journey of 235 miles. (1) 


€ Section 1.1 


A manufacturer stores drums of 
chemicals. During storage, evaporation 
takes place. A random sample of 10 drums 
was taken and the time in storage, x weeks, 
and the evaporation loss, y ml, are shown 
in the table below. 


x}3}5) 6] 8 |10}12])13}15]16|18 
y | 36) 50} 53) 61 | 69 | 79 | 82 | 90 | 88 | 96 
a On graph paper, draw a scatter 
diagram to represent these data. (2) 
b Give a reason to support fitting a 
regression model of the form 
y=a+tbx to these data. (1) 


c Find, to 2 decimal places, the value of 
a and the value of 5. 
(You may use }>x? = 1352, 


yy? = 53 112 and Exy = 8354.) (2) 


d Give an interpretation of the value 


of b. (1) 


e Using your model, predict the amount 
of evaporation that would take place 
after: 

i 19 weeks 
ii 35 weeks. 


(2) 
f Comment, with a reason, on the 
reliability of each of your predictions. 


(2) 


€ Section 1.1 


A metallurgist measured the length, /mm, 
of a copper rod at various temperatures, 
t°C, and recorded the following results. 


t l 
20.4 2461.12 
27.3 2461.41 
32.1 2461.73 
39.0 2461.88 
42.9 2462.03 
49.7 2462.37 
58.3 2462.69 
67.4 2463.05 


The results were then coded such that 
x=tand y=/- 2460. 
a Calculate S,, and S,.. 

(You may use }>x? = 15965.01 and 


Sxy = 757.467) (2) 


b Find the equation of the regression line 


of y on xin the form y =a + bx. (2) 
c Estimate the length of the rod at 
40°C. (1) 


d Find the equation of the regression line 
of Jont. (2) 
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e Estimate the length of the rod at 


90°C. (1) 
f Comment on the reliability of your 
estimate in part e. (1) 


€ Section 1.1 


A student is investigating the relationship 
between the price (y pence) of 100 g of 
chocolate and the percentage (x%) of the 
cocoa solids in the chocolate. 


The following data are obtained 


Chocolate x y 
brand (% cocoa) (pence) 
A 10 35 
B 20 55 
Cc 30 40 
D 35 100 
E 40 60 
F 50 90 
G 60 110 
H 70 130 


(You may use: ox = 315, x= 15.225, 
Vy = 620, Ly2= 56550, xy = 28750) 


a Draw a scatter diagram to represent 


these data. (2) 


b Show that S\, = 4337.5 and find S,,. (2) 


The student believes that a linear 
relationship of the form y = a + bx could 
be used to describe these data. 


c Use linear regression to find the value 
of a and the value of b, giving your 


answers to 3 significant figures. (2) 
d Draw the regression line on your 
diagram. (1) 


The student believes that one brand of 
chocolate is overpriced. 


e Use the scatter diagram to: 
i state which brand is overpriced 
ii suggest a fair price for this brand. 


Give reasons for both your answers. 


(3) 


€ Section 1.1 
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5 A mobile phone operator recorded the 
number of minutes used, x, and the 
number of text messages sent, y, for a 
random sample of 8 customers during a 
single month. 


x |250| 300 
y | 72} 81 


340 
90 


360 
94 


385 
102 


400 
106 


450 
115 


475 
124 


The operator suggests that the data can 
be described using the linear regression 
model y = 12.476 + 0.2311x. 


a Calculate the residual values for this 
model. (2) 

b By considering the residuals, explain 
whether a linear model is a suitable 


model for these data. (1) 
c Given that S\.. = 38 850, S,,, = 2090 and 


» yy 

S\,, = 8980, calculate the residual sum 

of squares (RSS). (2) 
A second mobile phone operator records 
the number of minutes used and number 
of text messages sent for a random 
sample of its customers. Using a linear 
regression model, the RSS for this sample 
is 18.254. 


d State, with a reason, which sample 
is more closely modelled by a linear 
regression model. (1) 


€ Sections 1.1, 1.2 


A sociologist recorded the marriage rates 
per 1000 people, m, and divorce rates per 
1000 people, d, from a random sample of 


6 US states. 

The data are summarised below: 
Sim = 101.9 Som? = 1768.47 
Sod = 49.7 >-d? = 430.63 
> md = 868.06 


a Calculate S,,,,, and S,,4- 
b Find the equation of the regression 


(2) 


line of don m. (3) 
c Use your equation to estimate the 

divorce rate if the marriage rate is 

153. (1) 


d Calculate the residual sum of squares 

(RSS). (3) 
The table shows the residual for each 
value of m. 


m Residual 
13.7 —0.703 95 
14.4 —0.3474 


16.9 1.468 85 
17.1 0.642 15 


18.6 x 
21.2 —0.552 
e Find the value of x. (2) 


f By considering the signs of the 
residuals, explain whether or not the 
linear regression model is suitable for 


these data. (1) 
€ Sections 1.1, 1.2 


A random sample of 7 online companies 
was taken. The monthly amount of 
advertising expenditure, x, in £1000s, 
and the monthly sales, y, in £1000, was 
recorded. The results were as follows: 


x |} 14/15/25 | 3.4] 13 | 2.2 | 1.8 
y | 370 | 440 | 660 | 950 | 330 | 550 | 720 


(You may use S,,. = 3.388 571, 

Syy = 289 771.4, S,, = 895.571 4) 

a Find the equation of the regression 
line of y on x in the form y =a + bx as 
a model for these results. (2) 


b Show that the residual sum of squares 
is 53 100, correct to three significant 


figures. (2) 
c Calculate the residual values. (2) 
d Write down the outlier. (1) 
e Comment on the validity of ignoring 

this outlier. (1) 
f Ignoring the outlier, produce another 

model. (2) 


g Use this model to estimate the monthly 
sales for a company with advertising 
expenditure of £1800. (1) 


Review exercise 1 


h Comment, giving a reason, on the 


reliability of your estimate. (1) 
< Sections 1.1, 1.2 


8 An anthropologist uses a linear regression 
model to predict the height, /cm, of a 
male humanoid from the length, /cm, of 
the femur bone. She collects data from 
8 skeletons. 


I h 
50.2 178.6 
48.4 173.7 
45.3 164.9 
44.8 163.8 
44.6 168.4 
42.8 165.1 
39.6 155.5 
38.1 155.1 
The data collected can be summarised as 
follows: 
> 1 = 353.8 >oh = 1325.1 


YEP = 15 762.5 Soh? = 219 944.9 
Yoh = 58 825.04 
a Calculate the equation of the 


regression line of / on /, giving your 
answer in the form h =a + bi. (3) 


b Use your regression line to predict the 
height of a male humanoid with femur 


length 45.1 cm. (2) 
The table shows the residuals for each 
value of /. 

I h 
50.2 1.47138 
48.4 0.03296 
45.3 —2.80543 
44.8 —2.94388 
44.6 2.04074 
42.8 Pp 
39.6 —1.24376 
38.1 1.24089 
c Find the value of p. (2) 
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Review exercise 1 


d By considering the signs of the 
residuals, or otherwise, comment on 
the suitability of the linear regression 


model for these data. (1) 
e Calculate the residual sum of squares 

(RSS). (2) 
An equivalent random sample of female 
humanoids is taken and the residual sum 
of squares is found to be 25.467. 


f State, with a reason, which sample is 
likely to have the best linear fit. (1) 


€ Sections 1.1, 1.2 


The scatter diagrams below were drawn 
by a student. 


Ep) 1 


Diagram A Diagram B 
YA VA 


x x 
x xX 


=v 


> 
x 


Diagram C 


The student calculated the value of the 
product moment correlation coefficient 
for each of the sets of data. 


The values were: 
0.68 —0.79 0.08 


Write down, with a reason, which value 


corresponds to which scatter diagram. (3) 
€ Section 2.1 


A young family were looking for a new 
three-bedroom semi-detached house. A 
local survey recorded the price, x, in £1000s, 
and the distance, y, in miles, from the 
nearest railway station of such houses. The 
following summary statistics were provided: 


Si. = 113 573 
Sy = 808.917 


a Use these values to calculate the 
product moment correlation 
coefficient. 


Syy = 8.657 


(2) 
b Give an interpretation of your answer 
to part a. (1) 


In another survey, the data for the same 
houses were supplied in km rather than 
miles. 


c State the value of the product moment 


correlation coefficient in this case. (1) 
< Section 2.1 


As part of a statistics project, Gill 
collected data relating to the length 

of time, to the nearest minute, spent 

by shoppers in a supermarket and the 
amount of money they spent. Her data 
for a random sample of 10 shoppers are 
summarised in the table below, where 

t represents time and m represents the 
amount spent over £20. 


t (minutes) m (£) 
15 -3 
23 17 

5 =19 
16 4 
30 12 

6 =9 
32 27 
23 6 
35 20 
27 6 


a Write down the actual amount 
spent by the shopper who was in the 


supermarket for 15 minutes. (1) 
b Calculate S,,, S,,,,,and S,,,. 
(You may use >¢?? = 5478, Som? = 2101, 
and >°tm = 2485) (3) 


c Calculate the value of the product 
moment correlation coefficient between 
tand m. (2) 


d Write down the value of the product 
moment correlation coefficient between 
t and the actual amount spent. Give a 


reason to justify your value. (1) 
On another day Gill collected similar 
data. For these data, the product moment 


correlation coefficient was 0.178. 


e Give an interpretation to both of these 
coefficients. (2) 


f Suggest a practical reason why these 
two values are so different. 


(1) 


€ Section 2.1 


During a village show, two judges, P 
and Q, had to award a mark out of 

30 to some flower displays. The marks 
they awarded to a random sample of 8 
displays are shown in the table below. 


Display A|B|\|C\|D|E|F|G\/A 
Judge P| 25/19] 21 | 23) 28} 17 | 16 | 20 
Judge Q |20/ 9 | 21/13) 17} 14) 11) 15 


a Calculate Spearman’s rank correlation 
coefficient for the marks awarded by 
the two judges. (4) 

After the show, one competitor 

complained about the judges. She claimed 

that there was no positive correlation 


between their marks. 


b Stating your hypotheses clearly, test 
whether this sample provides support 
for the competitor’s claim. 


Use a 5% level of significance. (4) 
€ Sections 2.2, 2.3 


The table below shows the price of the 
same ice lolly at different stands ona 
beach, and the distance of each stand 
from the pier. 


(/p) 14 


Review exercise 1 


Stand Distance from Price 
pier (m) () 
A 50 1.75 
B 175 1.20 
Cc 270 2.00 
D 375 1.05 
E 425 0.95 
F 580 1.25 
G 710 0.80 
H 790 0.75 
I 890 1.00 
J 980 0.85 


a Find, to 3 decimal places, the Spearman 
rank correlation coefficient between the 
distance of the stand from the pier and 
the price of the ice lolly. (4) 


b Stating your hypotheses clearly and 
using a 5% significance level, test for 
negative rank correlation between price 
and distance. (4) 

€ Sections 2.2, 2.3 


The numbers of deaths in one year from 
pneumoconiosis and lung cancer in a 
developing country are given in the table 
below. 


Age Deaths from | Deaths from 

group pneumoconiosis | lung cancer 
(years) (1000s) (1000s) 
20-29 12.5 3.7 
30-39 5.9 9 
40-49 18.5 10.2 
50-59 19.4 19 
60-69 31.2 13 

70 and over 31 18 


A charity claims that the relative 
vulnerabilities of different age groups are 
similar for both diseases. 


a Give one reason to support the use 
Spearman’s rank correlation coefficient 
in this instance. (1) 


b Calculate Spearman’s rank correlation 
coefficient for these data. (4) 


103 


@ 1s 


© 16 


Review exercise 1 


c Test the charity’s claim at the 
5% significance level. State your 
hypotheses clearly. (4) 


€ Sections 2.2, 2.3 


The product moment correlation 
coefficient is denoted by r and Spearman’s 
rank correlation coefficient is denoted 
by r,. 
a Sketch separate scatter diagrams, with 
five points on each diagram, to show: 
i r=1 
ii r,=—-l butr>-l 
Two judges rank seven collie dogs in a 
competition. The collie dogs are labelled 
A to Gand the rankings are as follows. 


Rank 1)/2)3),4]5)]6) 7 
Judgel | 4) C|/D|B|E!] FIG 
Judge2 | A) B)D|C\|E!|G/|F 


b i Calculate Spearman’s rank 
correlation coefficient for these data. 
ii Stating your hypotheses clearly, 
test, at the 5% level of significance, 
whether or not the judges are 


generally in agreement. (8) 
€ Sections 2.1, 2.2, 2.3 


The masses of a reactant t mg and a 
product p mg in ten different instances of 
a chemistry experiment were recorded in 
a table. 


t P 
1.2 3.8 
1.9 7 
3.2 11 
3.9 12 
2.5 9 
4.5 12 
5.7 13.5 
4 12.2 
1.1 2 
5.9 13.9 


(You may use > /7? = 141.51, 
S p? = 1081.74 and S-tp = 386.32) 
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a Draw a scatter diagram to represent 
these data. 


(2) 
b State what is measured by the product 


moment correlation coefficient. (1) 
e Calculate S , s., and S. (3) 
d Calculate the value of the product 

moment correlation coefficient r 

between ¢ and p. (2) 


e Stating your hypotheses clearly, test, 
at the 1% significance level, whether or 
not the correlation coefficient is greater 
than zero. (4) 


f With reference to your scatter diagram, 


comment on your result in parte. (1) 
€ Sections 2.1, 2.3 


A geographer claims that the speed of 
flow of water in a river gets slower the 
wider the river is. He measures the width 
of the river, w metres, at seven points and 
records the rate of flow, fms"!. 


Point A|B|C|D\|E| FIG 
w 1.3 | 1.8 | 2.2 | 3.1 | 4.8 | 5.2 | 7.3 
f 5.4/4.8) 4.9) 4.4 | 3.8 | 3.9 |) 2.5 


The Spearman’s rank correlation 
coefficient between w and fis —0.93. 


a Stating your hypotheses clearly, test 
whether or not the data provides 
support for the geographer’s claim. 
Test at the 1% level of significance. (4) 


b Without recalculating the correlation 
coefficient, explain how the Spearman’s 
rank correlation coefficient would 
change if: 

i the speed of flow at G was actually 
2.6ms! 

ii an extra measurement, H was taken 
with a width of 0.8m and a speed of 
flow of 6.2ms"!. (3) 


The geographer collected data from a 
further 10 locations and found that there 
were now many tied ranks. 
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c Describe how you could find 
Spearman’s rank correlation coefficient 


in this situation. (2) 
€ Sections 2.2, 2.3 


A continuous random variable X has 
probability density function 


f(x) = {iA -x3) O<x <2 
0 otherwise 


where x is a positive constant. 


a Show that k =+4 (3) 
b Sketch f(x). (2) 
Find: 
c E(x) (3) 
d the mode of X (2) 
e the median of X (3) 
f Comment on the skewness of the 
distribution. (2) 


€ Sections 3.1, 3.2, 3.3, 3.4 


A continuous random variable X has 
probability density function f(x) where 
kx(x — 2) 2<5x<3 
f(x) = 

(x) 0 otherwise 
and is a positive integer. 
a Show that k = (3) 
b Given that E(X) = %, find Var(X). (4) 
c Find the cumulative distribution 


function F(x). (4) 
d Show that the median value of X lies 
between 2.70 and 2.75. (3) 


€ Sections 3.1, 3.2, 3.3, 3.4 


Ben attempts to model the continuous 
random variable Y with the cumulative 
distribution function: 


0 y<l 
F,(y) =} 13y -4y?-9 l<y=<2 
1 yo? 


a Explain what is wrong with Ben’s 
model. 
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Review exercise 1 


Ben adapts his model to use the following 
cumulative distribution function. 


0 y<l 


Fy)=)kO*+y?-2) 1sys2 
1 yo? 
Using Ben’s second model, 
b show that k =75 (3) 
c find P(Y > 1.5) (2) 


d specify fully the probability density 
function f(y). (3) 


€ Sections 3.1, 3.2 


The continuous random variable XY has 
probability density function f(x) given by 


fi) = 20°=2) 2<x<3 

0 otherwise 

a Sketch f(x) for all values of x. (2) 
b Write down the mode of YX. (1) 
¢ Given that E(X) = 4, find Var(X). (3) 
d Find the median of YX. (3) 
e Comment on the skewness of this 


distribution. Give a reason for your 
answer. (2) 


€ Sections 3.1, 3.2, 3.3, 3.4 


The continuous random variable XY has 
probability density function given by 


1 


6x 0<=x<3 
f(x) =42—-4% 3<x<4 
0 otherwise 


a Sketch the probability density function 


of X. (3) 
b Find the mode of X. (1) 
c Specify fully the cumulative 

distribution function of X. (4) 


d Using your answer to part ¢, find the 
median of X. (3) 

e Find the 10th to 90th percentile range, 
giving your answer correct to three 
decimal places. (4) 


€ Sections 3.1, 3.2, 3.4 
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iew exercise 1 


The continuous random variable X has 
cumulative distribution function 


0 x <0 
F(x) = 42x? - x3 0O<=x<1 
1 x>1 


a Find P(X > 0.3). (2) 
b Verify that the median value of X lies 
between x = 0.59 and x = 0.60. (3) 


c Find the probability density function 


f(x). (3) 
d Evaluate E(X). (3) 
e Find the mode of X. (2) 
f Comment on the skewness of X. 

Justify your answer. (2) 


€ Sections 3.1, 3.2, 3.3, 3.4 


The continuous random variable XY has 
probability density function given by 
k O0<x<2 
f(x) = 7 2<x<4 
0 otherwise 
a Show that k = 5 77 5 (4) 
b Find E(x). (3) 


€ Sections 3.1, 3.3 


The random variable X is uniformly 
distributed over the interval [-1,5]. 


a Sketch the probability density function 


f(x) of X. (2) 
Find: 
b E(X) (2) 
ce Var(X) (2) 
d P(-0.3 < XY < 3.3) (2) 


€ Section 3.5 


The continuous random variable _X is 

uniformly distributed over the interval 
(2, 6). 

a Write down the probability density 


function f(x). (2) 
Find: 
b E(X) (2) 


) 27 


) 29 


ce Var(X) (2) 
d the cumulative distribution function of 
X, for all x (2) 


e P(2.3<X¥<3.4) (2) 
€ Section 3.5 


A string AB of length 5cm is cut, ina 
random place C, into two pieces. The 
random variable X is the length of AC. 


a Write down the name of the 
probability distribution of X and 
sketch the graph of its probability 
density function. 


(3) 
b Find the values of E(Y) and Var(X). (4) 
c Find P(Y > 3). (2) 


d Write down the probability that AC is 
exactly 3.cm long. (1) 
© Sections 3.5, 3.6 


The continuous random variable X is 
uniformly distributed over the interval 
a<x<Q8. 


a Write down the probability density 
function of X, for all x. 


(2) 

b Given that E(X) = 2 and P(X < 3) =2, 
find the value of a and the value of (3. 

(3) 


€ Sections 3.5, 3.6 


A gardener has wire cutters and a piece 
of wire 150.cm long which has a ring 
attached at one end. The gardener cuts 
the wire, at a randomly chosen point, into 
2 pieces. The length, in cm, of the piece 
of wire with the ring on it is represented 


by the random variable X. 

Find: 

a E(X) (2) 
b the standard deviation of ¥ (2) 


c the probability that the shorter piece of 


wire is at most 30cm long. (2) 
< Sections 3.5, 3.6 


™30 Ata funfair, the duration B seconds of 


(E) 31 


a ride on the Big Dipper has the normal 
distribution N(82, 37). The duration F of 
a ride on the Ferris Wheel has the normal 
distribution N(238, 77). Alice rides on the 
Big Dipper and the Ferris Wheel. 


a Find the probability that her ride on 
the Ferris Wheel is less than three 
times as long as her ride on the Big 


Dipper. (6) 
b State one assumption you have made 
and comment on its validity. (1) 


Paul rides on the Big Dipper three 

times in a row. The random variable D 
represents the total duration of the three 
rides. 


c Find the distribution of D. (3) 


Given that Alice starts a ride on the 
Ferris Wheel at the same time as Paul 
starts his three rides on the Big Dipper, 


d find the probability that Alice and 
Paul’s rides finish within 10 seconds of 


one another. (5) 
< Section 4.1 


A workshop makes two types of electrical 
resistor. 


The resistance, XY ohms, of resistors of 
Type A is such that ¥ ~ N(20, 4). 


The resistance, Y ohms, of resistors of 
Type B is such that Y ~ N(10, 0.84). 


When a resistor of each type is connected 
into a circuit, the resistance R ohms of 
the circuit is given by R= X¥ + Y where X 
and Y are independent. 


Find: 

a E(R) (2) 
b Var(R) (2) 
c P(28.90 < R < 32.64) (3) 


< Section 4.1 


The weights of adult men are normally 
distributed with a mean of 84kg anda 
standard deviation of 11 kg. 
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Review exercise 1 


a Find the probability that the total 
weight of 4 randomly chosen adult 
men is less than 350 kg. (3) 


The weights of adult women are normally 
distributed with a mean of 62kg and a 
standard deviation of 10kg. 


b Find the probability that the weight of 
a randomly chosen adult man is less 
than one and a half times the weight of 


arandomly chosen adult woman. (4) 
€ Section 4.1 


The random variable D is defined as 
D=A-3B+4C 

where A ~ N(5, 27), B~ N(7, 37) and 

C~N(9, 4’), and A, Band C are 

independent. 

a Find P(D < 44). (4) 

The random variables B,, B, and B; 

are independent and each has the same 

distribution as B. 

The random variable X is defined as 
X=A- y B,+A4C. 


b Find P(X > 0). (4) 


€ Section 4.1 


A manufacturer produces two flavours 

of soft drink, cola and lemonade. The 

weights, C and L, in grams, of randomly 

selected cola and lemonade cans are such 

that C ~ N(350, 8) and L ~ N(345, 17). 

a Find the probability that the weights 
of two randomly selected cans of cola 
will differ by more than 6g. (4) 

One can of each flavour is selected at 

random. 

b Find the probability that the can of 
cola weighs more than the can of 
lemonade. (3) 

Cans are delivered to shops in boxes of 

24 cans. The weights of empty boxes are 

normally distributed with mean 100 g and 

standard deviation 2 g. 
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Review exercise 1 


c Find the probability that a full box of 
cola cans weighs between 8.51 kg and 


8.52 kg. (4) 
d State an assumption you made in your 
calculation in part c. (1) 


€ Sections 4.1 


Challenge 


1 The table shows data that was collected from a 
scientific experiment: 


ay 1 3 4 5 if 8 
y 15 | 33 || 53 || te | was || ile 


a Use your calculator to find the following 
regression models for these data: 
i Linear (vy =a + bx) 
ii Quadratic (y = a + bx + cx?) 
iii Exponential (y = ae’ or y = ab‘) 


b By calculating the residuals for each model, 
determine which model is most suitable. 
€ Sections 1.1, 1.2 


2 Acontinuous random variable X has probability 
density function 


fae (ocr x >So 
0 otherwise 
a Show that k= 1. 


b Find the cumulative distribution function, 
F(X). 

c Hence, or otherwise, find the exact value of 
EXC << Aie<< “A), € Sections 3.1, 3.2 
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3 Let X,, X,, X3, ... X,, be identically distributed 


independent random variables, each with the 
continuous uniform distribution on [0, 1]. Define 
the random variable Y as the maximum value 
taken by each of the X;. 


a Show that E(Y) =—" 
n+1 
b Find an expression for the median of Y in 


terms of n. 


If XY and Y are independent continuous random 
variables with probability density functions 

f(x) and g(y) respectively, then the probability 
density function of Z = X + Yis given by 


h(e) = [ f(e-2)g( dr 
Find and sketch the probability density 
function of 
c X,+X, 


d X,+X,4+ xX. € Section 3.5 


Estimation, confidence 
intervals and tests using 
a normal distribution 


After completing this chapter you should be able to: 


e Understand and use estimates and estimators — pages 110-120 
e Understand bias — pages 113-120 
e Find the standard error — pages 115-120 
@ 


Calculate and use confidence intervals for population parameters 
— pages 120-126 
@ Carry out hypothesis tests for the difference between the means of two 
normally distributed random variables with known variances 
— pages 127-131 
@ Carry out hypothesis tests using large sample results in cases where the 
population variance is unknown — pages 132-134 
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Prior knowledge check 


Cait | 


1 The independent normal random variables A and B have 
distributions N(6, 22) and N(7, 32) respectively. 
a Find P(A > B). 
The random variable XY is defined as ¥ = 3A + B. 
b Find the distribution of X. € Chapter 4 
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! In large-scale production 
processes it might 
be impossible to test 
every component. 
_~ Engineers use samples 
to determine ranges of 
values that are likely 


to contain population ore Sees a aE aie 
parameters such as the . A sample of size 8 is taken from a population with distribution 


é — ary 
a earenvariances N(6, 3°), and the sample mean,_X, is calculated. 


= 
wg 2 Arandom sample of size 20 is taken from a population having a 
= normal distribution with mean jz and standard deviation 1.5. 
The sample mean is 21. 
2 Test the following hypotheses at the 1% level of significance: - 
lee (= 20), Inhis pa > ZC < SM2, Chapter3 § 


—> Mixed exercise Q13 Find P(X > 7). < FS1,Chapter5 Fe 
z 


Chapter 5 


&) Estimators, bias and standard error 


If you have a large population, for example the population of students in a school, or the population 
of trees in a forest, it is too time consuming and often costly to carry out a census to record, for 
example, the heights of all pupils or trees. In cases like these, population parameters such as the 
mean, yu, or the standard deviation, a, are likely to be unknown. 


In your A level course, you looked at methods of 
sampling that allow you to take a representative 
sample that can be used to estimate various 
population parameters. 


TLE A census observes every member of a 


population, whereas a sample is a selection 
of observations taken from a subset of the 
population. There are various ways of selecting a 


A common way of estimating population random sample in practice. < SM1, Chapter 1 


parameters is to take a random sample from 
the population. 


= If X is a random variable then a random sample of size 7 will consist of 7 observations of the 
random variable X which are referred to as X,, X>, X3, ..., X,, where the X; 


* are independent random variables 
¢ each have the same distribution as X. 


= A statistic, 7, is defined as a random variable 
consisting of any function of the X;, that involves 
no other quantities, such as unknown population 
parameters. 


Notation ] X;, represents the ith 


observation of a sample. The specific 
value of the observation is denoted by x;. 


n x 
For example, X, the sample mean, is a statistic whereas rs — p17 is not a Statistic since it involves 
the unknown population parameter ju. 


A sample, X,, X, ..., 
State whether or not each of the following are statistics. 


X,, 1s taken from a population with unknown population parameters py and o. 
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X, + X, + X; n X; —pE . 
2S b max(X;, X, ..., X,,) c a 
3 i=l 
vaya It is only a function of the sample Xj, X>,..., X,. 
a is a statistic. A statistic need not involve all members of the 
sample. 
b max(X;, Xo, ..., X,) is a statistic. an ne 
mn (X,- p\? It is only a function of the sample X,, X>,..., X,. 
c | e is not a statistic. 
i=1 | | 


The function contains pz and o. 


Estimation, confidence intervals and tests using a normal distribution 


Since it is possible to repeat the process of taking a sample, the specific value of a statistic T, 
namely t, will be different for each sample. If all possible samples are taken, these values will form a 
probability distribution called the sampling distribution of 7. 


= The sampling distribution of a statistic 7 is the probability distribution of 7. 


If the distribution of the population is known, then the sampling distribution of a statistic can 
sometimes be found. 


The weights, in grams, of a consignment of apples are normally distributed with a mean yw and 
standard deviation 4. A random sample of size 25 is taken and the statistics R and T are calculated 
as follows: 


R=X,,- X,and T= X, +X,4+ aes. + X55 
Find the distributions of R and 7: 


The sample will be X;, Xo, ..., X25 where each -————_ State the distribution for each of the X,. 
X,~ N(, 42). 
Now R= X5,-X, > R~N(p — p, 42 + 42) - E(X — Y) = E(X) - E(Y). Since each observation 


— inarandom sample is independent, you can also 


> 
that is R ~ N(O, (4V2)?) use Var(X — Y) = Var(X) + Var(Y). << Chapter 4 


PS Ky Ka on A Be 
so. TwN(25p, 25 x 4%) + 
or Tw~N(25p, 207) 


A large bag contains counters. 60% of the counters have the number 0 on them and 40% have the 
number 1. 


n 
| IF.X;~ N(u, 0), then >) X; ~ N(np, no?). 
= € Chapter 4 


a Find the population mean jy and population variance o? of the values shown on the counters. 
A simple random sample of size 3 is taken from this population. 
b List all the possible observations from this sample. 
c Find the sampling distribution for the mean 
ro X, + X,4+ X3 
3 
where X,, X, and X; are the values shown on the three counters in the sample. 
d Hence find E(X ) and Var(X ). 
e Find the sampling distribution for the sample mode, M. 
f Hence find E(M) and Var(M). 
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Chapter 5 


* XY represents the value shown on a randomly chosen counter, then 
X has distribution 


x (@) 
F(X Sx) 


oo 
|r} — 


w=E(X)=doxP(X=x=O0+25p=% 
Vx 
o2 = Var(X) =o x*P(X=x) - p2=0+?xe-Asctak 
Vx 


The possible observations are 


(O, O, O) 
(ly 0; O) (1,0) (Os O51) 
(1, 1, O) (1, O, 1) (O, 1, 1) 


(1, 1, 1) 

= 3 
PpxX¥=0)=(2) = i.e. the (O, O, O) case 

— 2 
P(X = 5) = 3 x - x (2) = aS i.e. the (1, O; QO), (O, ih O), 

(O, O, 1) cases 

ake 2 

rix = 2 =3~x (2) x 2 = os i.e. the (1, 1 O), (1, O, 1), 
(O, 1, 1) cases 

_ 3 
PY = 1) = (2) = — i.e. the (1, 1, 1) case. 
So the distribution for X is 
_ 1 2 
x O 3 3 1 

= 27 54 36 8 
p(x) 725 725 725 725 

= 1. 54,2. 36 8 _ 1842448 2 
EX) =O+gx qe t aX jst 1X ge =— as 5 

= 1. 54,4. 36 8 4 641648 20 2 

Var(X) =O + 3 X ia5 + 5X jag + 1X Gas — BS =~ G95 — ios = BS 


The sample mode can take values O or 1. 
P(M=0)=S6+ 34-8 i.e. cases (O, O, O), (1, O, O), 
(O, 1, O), (O, O, 1) 


and. POLS ()= oS i.e. the other cases. 


50 the distribution of M is m O i 
i 44 
pm) | jas 725 


E(M)=O+1xttat 


2 
and Var(M) = 0+ 1x 4% - (44)" = 0.226 
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Note ) Notice that 


E(X) =p but E(M) 4 pu 
and that neither E(Y) 
nor E(M) is equal to the 
population mode, which 
is of course zero as 60% 
of the counters have a 


zero on them. 


Estimation, confidence intervals and tests using a normal distribution 


In your A level course, you calculated the mean and variance of sets of sample data and used these as 
estimates for the equivalent population parameters. 


= A statistic that is used to estimate a population parameter is called an estimator and the 
particular value of the estimator generated from the sample taken is called an estimate. 


You need to be able to determine how reliable these sample statistics are as estimators for the 
corresponding population parameters. 


Since all the X; are random variables having the same mean and variance as the population, you can 
sometimes find expected values of a statistic 7, E(T), and this will tell you what the ‘average’ value of 
the statistic should be. 


Example 


A random sample X,, X>, ..., X,, is taken from a population with XY ~ N(u, 07). 
Show that E(Y) = pu. 


X= ee cae eo 
F(X) = TEX, +... +X) Use E(aX) = aE(X). 
= LEM) +... + EX) « E(X + Y) =E(X) + E(Y). 
1 
= nl +... + p) 
ny 
=a 
E(X) = pu 


This example shows that if you use the sample mean as an estimator of the population mean then ‘on 
average’ it will give the correct value. 


This is an important property for an estimator to have. You say that_Y is an unbiased estimator of i. 
A specific value of x will be an unbiased estimate for ju. 


= If a statistic 7 is used as an estimator for a population parameter 6 and E(7) = 0, then Tis 
an unbiased estimator for 0. 


When selecting suitable estimators for population parameters, bias is one consideration. In Example 3, 
you found two statistics based on samples of size 3 from a population of counters of which 60% had 
the number 0 on them and 40% had the number 1. The population mean jz was £ and the population 
mode was 0 (since 60% of the counters had 0 on them). The two statistics that you calculated were 
the sample mean X and the sample mode M. You could use either of them as estimators for ju, the 
population mean, but you saw that E(X) = jz and E(M) # pu. So in this case, if you wanted an unbiased 
estimator for ju, you would choose the sample mean Y rather than the sample mode M. How about 
an estimator for the population mode? Neither of the statistics that you calculated had the property 


of being unbiased since E(Y) = pp = é and E(M) = 5 whereas the population mode was 0. 
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Intuitively, you might prefer the estimator M since it is, after all, a mode and is also slightly closer 
to the population mode. In this case you refer to M as a biased estimator for the population mode. 
The bias is simply the expected value of the estimator minus the parameter of the population it is 
estimating. 


= If a statistic 7 is used as an estimator for a population parameter 6 then the bias = E(7) — 6. 
In this case the bias is 4+ 
= For an unbiased estimator, the bias is 0. 


You saw in Example 4 that the mean of a sample is an unbiased estimator for the population mean. 
If you take a sample _X, of size one from a population with mean jz and variance o2, then the sample 
mean is Y = X,, because there is only one value. So E(XY’) = E(X,) = py. 


If you wanted to find an estimator for the Note | In general, the variance of the sample will 
population variance, you might try using the be an underestimate for the variance of the 
ox, - X)? population. This is because the statistic 
variance of the sample, V = =. Se ee - 
n i 

——j,— uses the sample mean X rather 

For our sample _X; of size one, the variance of than the population mean ju, and on average the 
(X,-X)? sample observations will be closer to ¥ than to pu. 

the sample will be rr aia (X, — X,)* =0. 


So for a sample of size one, E(V) = 0 # o%. This illustrates that the variance of the sample is not an 
unbiased estimator for the variance of the population. 


You can use a slightly different statistic, called the ‘Online | Explore biased and cy 


sample variance, as an unbiased estimator for Eblelee tee) osline Creel 
F : GeoGebra. 
the population variance. 


= An unbiased estimator for o7 is given by the sample variance S* where 


1< = 
ae n-1 AX; =x)? Notation | S? is the estimator (a random 
. variable), and s@ is the estimate (an 
There are several ways to calculate the value of s* observation from this random variable). 


for a particular sample: 


sea Gs x)? 


~ nN — 1 i=1 
- Sx You can use the equivalence of 
n—1 these forms to show that s? is an unbiased 
i (= x2 estimate forc?. = Exercise 5A Challenge 
= x2 
n-1\ " 
1 2 2 
= x* — nx 
Lear 


The form that you use will depend on the information that you are given in the question. 


Although a sample of size one can be used as an unbiased estimator of ju, it is clear that, in practice, 
a single observation from a population will not provide a useful estimate of the population mean. 
You need some way of differentiating between the quality of different unbiased estimators. 
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A random sample X,, X>, ..., X,, is taken from a population with XY ~ N(u, 7). 


—  g 
Show that Var(X ) = >, 


Katy +. + %) 


Use Var(aX) = a@Var(X). 


Var(X) = art X; Scie) 


= a (WartX) +... + Var(X,)) » Use Var(X + Y) = Var(X) + Var(Y). 


One reason that the sample mean is used as an estimator for zis that the variance of the estimator 
= 2 
Var(X) = = decreases as n increases. For larger values of n, the value of an estimate is more likely to 


be close to the population mean. So the greater the value of n, the better the estimator is. 


= The standard deviation of an estimator is called the standard error of the estimator. 


When you are using the sample mean, _Y, you can , Watch out prnotantineereraiea te enean 
use the following result for the standard error. use the second version of this standard error in 


= Standard error of X =-2- or —— situations where you do not know the population 
vn vn standard deviation. 


Example 


The table below summarises the number of breakdowns, X, on a town’s bypass on 30 randomly 
chosen days. 


Number of breakdowns 2 3 4 5 6 7 8 9 
Number of days 3 5 4 3 > 4 4 


a Calculate unbiased estimates of the mean and variance of the number of breakdowns. 
20 more days were randomly sampled and this sample had x = 6.0 days and s? = 5.0. 
b Treating the 50 results as a single sample, obtain further unbiased estimates of the population 


mean and variance. Notation | eR a ; 
c Find the standard error of this new estimate 5 Haletlole lets see to Gescbea 
estimate of a parameter. 
of the mean, For example: 
d Estimate the size of sample required to 6? represents an estimate for the population 
achieve a standard error of less than 0.25. variance o2. 
ju represents an estimate for the population 
mean p. 
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a 


By calculator, 
>>x = 160 and >>x? = 990 


S50 fiz ¥ == 5.33 
>. 990 - 30x? 
s=——— 


eis 
= 4.7126 = 4.71 (3 54) 


and G#= 


New sample: y = 6.0 > ly = 20 x G0 = 120 


Dy = 20 «CF 2 
12 
So Diy? =5x 194+ 20x 36 
= jy? =8i5 
So the combined sample (w) of size 50 has 
>“w = 160 + 120 = 280 


ss=5.0 > 


Problem-solving 


First you need to ‘unwrap’ the formulae 
for y and s? to find >\y and > y?. 


Dw? = 990 + 615 = 1805 | 
Then the combined estimate of ju is 


260 
50 7 5.6 


and the estimate for oc? is 


2 _ 1805 - 50 x 5.6" 


Sw 49 
= 52 = 4.8367... = 4.84 (3 sf) 


The best estimate of a? will be s< since it is 
ie 2 
based on a larger sample than s¢ or 5‘. 


Sy |4.836... 
So the standard error is = “ 
V50 50 


= 0.311 (3 5.F) 


To achieve a standard error < 0.25 you require 


au <025 


/4836.. 
= Vn > G58 
ns 8757. 
=> He 7 L360. 


So we need a sample size of at least 78. 
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Exercise 5A) 


1 The lengths of nails produced by a certain machine are normally distributed with a mean yu 


and standard deviation c. A random sample of 10 nails is taken and their lengths 
{X, Xo, X3, ..., Xio} are measured. 
i Write down the distributions of the following: 
10 2X, + 3X, 10 
a DX, b—2 © LG 1) 
_ 5 10 16 
ay e KEN, ¢ (**) 
1 6 1 


ii State which of the above are statistics. 


A large bag of coins contains Ip, 5p and 10p coins in the ratio 2:2: 1. 

a Find the mean yp and the variance o? for the value of coins in this population. 
A random sample of two coins is taken and their values X; and X, are recorded. 
b List all the possible observations from this sample. 


: ; hoped a8 — X,+X, 

c Find the sampling distribution for the mean XY = a 
= = 1 
d Hence show that E(X ) = yz, and Var(X ) = = 


Find unbiased estimates of the mean and variance of the populations from which the following 
random samples have been taken. 


a 21.3; 19.6; 18.5; 22.3; 17.4; 16.3; 18.9; 17.6; 18.7; 16.5; 19.3; 21.8; 20.1; 22.0 
b 1; 2; 5; 1; 6; 4; 1; 3; 2; 8; 5; 6; 2; 4; 3; 1 

ce 120.4; 230.6; 356.1; 129.8; 185.6; 147.6; 258.3; 329.7; 249.3 

d 0.862; 0.754; 0.459; 0.473; 0.493; 0.681; 0.743; 0.469; 0.538; 0.361 


Find unbiased estimates of the mean and the variance of the populations from which random 
samples with the following summaries have been taken. 


a n= 120 Sox = 4368 Sox? = 162466 
b n=30 Sox = 270 Sox? = 2546 
ce n=1037 9 >ix=1140.7 2x2 = 1278.08 
d n=15 Sox = 168 > x? = 1913 


The concentrations, in mg per litre, of a trace element in 7 randomly chosen samples of water 
from a spring were: 


240.8 237.3 236.7 236.6 234.2 233.9 232.5. 
a Explain what is meant by an unbiased estimator. (1 mark) 


b Determine unbiased estimates of the mean and the variance of the concentration 
of the trace element per litre of water from the spring. (4 marks) 
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Cartons of orange juice are filled by a machine. A sample of 10 cartons selected at random 
from the production contained the following quantities of orange juice (in ml). 


201.2) 205.0 209.1 202.3 204.6 206.4 210.1 201.9 203.7 207.3 


Calculate unbiased estimates of the mean and variance of the population from which 
this sample was taken. (4 marks) 


A manufacturer of self-assembly furniture required bolts of two lengths, 5cm and 10cm, in the 
ratio 2: 1 respectively. 


a Find the mean yp and the variance o? for the lengths of bolts in this population. 


A random sample of three bolts is selected from a large box containing bolts in the required 
ratio. 


b List all the possible observations from this sample. 

c Find the sampling distribution for the mean X. 

d Hence find E(X ) and Var(X ). 

e Find the sampling distribution for the mode M. 

f Hence find E(/) and Var(M). 

g Find the bias when M is used as an estimator of the population mode. 


A biased six-sided dice has probability p of landing on a six. 


Every day, for a period of 25 days, the dice is rolled 10 times and the number of sixes X is 
recorded, giving rise to asample X, , X>, ..., Xs. 


a Write down E(X) in terms of p. 
b Show that the sample mean _Y is a biased estimator of p and find the bias. 


c Suggest a suitable unbiased estimator of p. 


The random variable XY ~ U[-a, a]. 
a Find E(X) and E(X”). 
A random sample X,, X>, X3 is taken and the statistic Y= X{7 + X3 + X%is calculated. 


b Show that Y is an unbiased estimator of a2. 


John and Mary each independently took a random sample of sixth-formers in their college and 
asked them how much money, in pounds, they earned last week. John used his sample of size 20 
to obtain unbiased estimates of the mean and variance of the amount earned by a sixth-former 
at their college last week. He obtained values of ¥ = 15.5 and s2 = 8.0. 


Mary’s sample of size 30 can be summarised as )_y = 486 and 9)? = 8222. 


a Use Mary’s sample to find unbiased estimates of ~ and o°. (2 marks) 
b Combine the samples and use all 50 observations to obtain further unbiased 

estimates of y and o?. (4 marks) 
c Explain what is meant by standard error. (1 mark) 
d Find the standard error of the mean for each of these estimates of ju. (2 marks) 
e Comment on which estimate of 4. you would prefer to use. (1 mark) 


Estimation, confidence intervals and tests using a normal distribution 


fy 11 A machine operator checks a random sample of 20 bottles from a production line in order to 


(E/P) estimate the mean volume of bottles (in cm?) from this production run. The 20 values can be 
summarised as }\x = 1300 and }>x? = 84 685. 
a Use this sample to find unbiased estimates of jz and o?. (2 marks) 


A supervisor knows from experience that the standard deviation of volumes on this process, o, 
should be 3cm3 and he wishes to have an estimate of yz that has a standard error of less than 
OS cm", 


b Recommend a sample size for the supervisor, showing working to support your 
recommendation. (2 marks) 


c Does your recommended sample size guarantee a standard error of less than 0.5 cm? 
Give a reason for your answer. (1 mark) 


The supervisor takes a further sample of size 16 and finds }>x = 1060. 
d Combine the two samples to obtain a revised estimate of ju. (2 marks) 


(E) 12 The heights of certain seedlings after growing for 10 weeks in a greenhouse have a 
standard deviation of 2.6cm. Find the smallest sample that must be taken for the 
standard error of the mean to be less than 0.5 cm. (3 marks) 


(E) 13 The hardness of a plastic compound was determined by measuring the indentation produced 
by a heavy pointed device. 


The following observations in tenths of a millimetre were obtained: 

4.7, 5.2, 5.4, 4.8, 4.5, 4.9, 4.5, 5.1, 5.0, 4.8. 
a Estimate the mean indentation for this compound. (1 mark) 
b Find the standard error for your estimate. (2 marks) 


c Estimate the size of sample required in order that in future the standard error of 
the mean should be just less than 0.05. (3 marks) 


(P) 14 Prospective army recruits receive a medical test. The probability of each recruit passing the test 
is p, independent of any other recruit. The medicals are carried out over two days and on the 
first day m recruits are seen and on the next day 27 are seen. Let X, be the number of recruits 
who pass the test on the first day and let XY, be the number who pass on the second day. 


a Write down E(X,), E(X,), Var(X,) and Var(X,). 


XxX, X. 
b Show that a and = are both unbiased estimates of p and state, giving a reason, which you 
would prefer to use. 


1/X1  X2)\., : : 
c Show that ¥ = a\y tI) an unbiased estimator of p. 


X) 


Xy+ 
d Show that Y= ] is an unbiased estimator of p. 


3n 


: .. 1 X 

e Which of the statistics —, —, 
n’2n 
a 2X, + X> 

The statistic T = a 


X or Yis the best estimator of p? 


is proposed as an estimator of p. 


f Find the bias. 
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fyi5 A large bag of counters has 40% with the number 0 on, 40% with the number 2 and 20% 


C) 


with the number 1. 

a Find the mean yu and the variance o? for this population of counters. 
A random sample of size 3 is taken from the bag. 

b List all the possible observations from this sample. 

Find the sampling distribution for the mean Y. 

Find E(Y ) and Var(X ). 

Find the sampling distribution for the median N. 

Hence find E(V) and Var(N). 

Show that N is an unbiased estimator of ju. 


=o = 0 a 


Explain which estimator, X or N, you would choose as an estimator of ju. 


Challenge 
ey ee es 2 pt 
a Show that => XG; x) qx nx ). 


b Hence, or otherwise, show that s* is an unbiased estimate for 
the population variance o. 


& Confidence intervals 


The value of @ which is found from a sample and used as an unbiased estimate for the population 
parameter @ is very unlikely to be exactly equal to @. 


There is no way of establishing, from the sample data only, how close the estimate is. 
You can instead form a confidence interval for 0. 


= A confidence interval (C.I.) for a population parameter 6 is a range of values defined so that 
there is a specific probability that the true value of the parameter lies within that range. 


For example, you could establish a 90% confidence interval, or a 95% confidence interval. 


A 95% confidence interval is an interval such that t Watch out J The population parameter 
there is a 0.95 probability that the interval contains 9. Wo ouverte 


Different samples will generate different confidence value in probabilistic terms. 
intervals since estimates for the parameter will change 
based on the data inthe sample and the sample size. 


For large n, you know from the central limit theorem that ‘Links } Pern 
: This approximation improves 
whatever the distribution of the random variable _X, the PP P 


: . ed for larger samples and is exact if the 
sample mean will be approximately normally distributed: 


population is normally distributed. 
= o2 < FS1, Chapter 5 
¥~ Mu, 5) 

n 
Hence, if you know the population standard deviation, you can establish a confidence interval for the 
population mean, pu, using the standardised normal distribution. 
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Show that a 95% confidence interval for 1, based on a sample of size n, is given by 


fai 


vn 


¥=196%-— 


geo. 


| 


Estimation, confidence intervals and tests using a normal distribution 


X is approximately ~ N( i 
n 


d 


and therefore 


Problem-solving 


You will need to use the standardised normal 
distribution N(O, 1%) to tackle problems like this. 


Using tables, or your calculator, you can see 
that for the N(O, 12) distribution 
P(Z > 1.9600) = P(Z < -1.9G00) = 0.025 
and so 95% of the distribution is between 
-1.9600 and 1.9600. 
f(z) 


4196 0 196 Z 

50 P-196 < Z < 1.96) =095 

> ica eae i) = 0.95 
vn 


Look at the inequality inside the probability 
statement: 


oO = oO 
—196 x =< X -uw< 196 x — 
ia . ia 


—— 


¥+196 xs p> F = 196x 
vn 


nN 
Y=(9e4— <= wey £146 x%— 
“a Vn 


So the 95% confidence interval for p is 


( 


¥-196*%274196x2 


Vn vn 


af Sane). 


If X¥~N @ a8 
~ N(u, a) then Z = z 


< SM2, Chapter 3 


Whatever the distribution of the population, you 
know by the central limit theorem that Y will be 
approximately normal. 


——— Start to isolate pu. 


—— Multiply by -1 and change the inequalities. 


Notation | The upper and lower values of a 


= The 95% confidence interval for ju is [¥ -— 1.96 x 


The 1.96 in the formula above is determined by 


confidence interval are sometimes called the 
confidence limits. 

oC = o 

© £41.96 x =): 

vn’ vn 


the percentage points of the standardised normal 


distribution. By changing this value you can formulate confidence intervals with different levels of 


confidence. 
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For example, a 99% confidence interval would 
see 1.96 replaced by 2.5758, since that is the 
value of z such that P(-z < Z < z) =0.99. 


Note | The choice of what level of confidence to 
use in a particular situation will depend on the 
problem involved but a value of 95% is commonly 
used if no other value is specified. 


Interpreting confidence intervals 


e First, it is important to remember that yu is a fixed, but unknown, number and as such it cannot vary 
and does not have a distribution. 


e Secondly, it is worth remembering that you base a 95% confidence interval on a probability 
statement about the normal distribution Z ~ N(O, 12). 


e However, although you start by considering probabilities associated with the random variable Z, 
the final confidence interval does not tell you the probability that ju lies inside a fixed interval. 
Rather, since yz is fixed, it is the confidence interval that varies (according to the value of x). 


e What a 95% confidence interval tells you is that the probability that the interval contains jz is 0.95. 


The diagram opposite illustrates the 95% confidence intervals M 
calculated from different samples and also shows the position 

of jz. Suppose 20 samples of size 100 were taken and 95% ‘ : 
confidence intervals for 1 were calculated for each sample. This 

would give 20 different confidence intervals, each based on one t } 

of the 20 different values of x. If you imagine for a moment 
that you actually do know what the value of jz is then you can 
plot each of these confidence intervals on a diagram similar to 
the one here; you would expect that 95% of these confidence 
intervals would contain the value jz but about once in every 
20 times you would get an interval which did not contain py (like the one marked * here). The problem 
for the statistician is that they never know whether the confidence interval they have just calculated 
is one that contains yu or not. 


However, 95% (or 90% or 99%, depending on the degree of confidence required) of the time the 
interval will contain ju. 


The breaking strains of reels of string produced at a certain factory have a standard deviation of 
1.5kg. A sample of 100 reels from a certain batch were tested and their mean breaking strain was 
5.30kg. 


a Find a 95% confidence interval for the mean breaking strain of string in this batch. 


The manufacturer becomes concerned if the lower 95% confidence limit falls below 5 kg. 
A sample of 80 reels from another batch gave a mean breaking strain of 5.31 kg. 


b Will the manufacturer be concerned? 
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The distribution for breaking strains is not known 
but the sample is quite large and by the central limit 
theorem X will be approximately normally distributed. Use the x + 1.96 x —@ formula. 
a 95% contidence limits are 
_ oO 1.5 
% = 196 x ie 9.30 = 1.96 x Vi0O C= In your exam you can define 
So a 95% confidence interval is (5.006, 5.594). a confidence interval 
* by giving the confidence limits, 
b Lower 95% confidence limit is e.g. 5.30 + 0.294 
7-196x 2 =531- 196 x 1.5 + using interval notation, 
va v60 e.g. (5.006, 5.594) or [5.006, 5.594] 
= 4.96 + using inequalities, 
so the manufacturer will be concerned. e.g. 5.006 < ps < 5.594. 


= The width of a confidence interval is the difference 


between the upper confidence limit and the lower confidence limit. This is 2 x z x a where 
n 


zis the relevant percentage point from the standardised normal distribution, for example 
1.96, 1.6449, etc. 


The greater the width, the less information you have about the population mean. There are three 
factors that affect the width: the value of o, the size of the sample and the degree of confidence 
required. In a particular example where o and n are determined, the only factor you can change to 
alter the width is the degree of confidence. A high level of confidence (e.g. 99%) will give a greater 
width than a lower level of confidence (e.g. 90%), and the statistician has to weigh up the advantages 
of high confidence against greater width when calculating a confidence interval. 


Example 


A random sample of size 25 is taken from a normal population with standard deviation 2.5. 
The mean of the sample is 17.8. 


a Find a 99% C.I. for the population mean p. 
b What size sample is required to obtain a 99% C.I. of width of at most 1.5? 


c What confidence level would be associated with the interval based on the above sample of 25 but 
of width 1.5, i.e. (17.05, 18.55)? 


a 99% contidence limits are 

2.5 
V25 
So a 99% contidence interval is (16.51, 19.09). 


X27 30x 5 =O 2 2.5750:% -— Use the table on page 214 to find 2.5758. 
n 


oO eae 
\ = a or the definition f 
b Width of 99% Cl. is 2 x 2.5758 x 25 |_| Use the 2 x zx ue or the definition for 
vn the width. 
50 you require 1.5 pee 
vn 
ie: MP TOT Ves 


50 you need n=74 
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c 


2.0) 
V25 


A width of 15> 15=2xz*x 
Z=15 
From the table on page 214 you find that 
PZ <1.5) =0:'9332 


and 50 P(Z > 1.5) = P(Z < -1.5) 
=1-—-09332 
= 0.0668 


f(z) 


0.0666 0.0668 


-1.5 Be) Zz 


So the confidence level is 
100 x (1 -— 2 x 0.0668) = 86.6%. ——_—__ 


The percentage of the confidence interval is given 
by the area between z = +1.5. 


Exercise 5B) 


1 


4 
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A random sample of size 9 is taken from a normal distribution with variance 36. 
The sample mean is 128. 


a Find a 95% confidence interval for the mean y of the distribution. 
b Find a 99% confidence interval for the mean jz of the distribution. 


A random sample of size 25 is taken from a normal distribution with standard deviation 4. 
The sample mean is 85. 


a Find a 90% confidence interval for the mean p of the distribution. 
b Find a 95% confidence interval for the mean yz of the distribution. 


A random sample is taken from a distribution with mean py and variance 4.41. The sample has 
the following values: 


23.4, 21,8, 240, 225, 


Niall says that even though the sample is small, he can still use the normal distribution to 
obtain confidence limits for ju. 


a State what assumption Niall must make in order for this to be true. (1 mark) 


b Given that this assumption is true, use the sample to find 98% confidence limits for 
the mean wp. (3 marks) 


A normal distribution has standard deviation 15. Estimate the sample size required if the 
following confidence intervals for the mean should have width of less than 2. 


a 90% b 95% c 99% 


m 5 
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An experienced poultry farmer knows that the mean mass pu kg for a large population of 
chickens will vary from season to season but the standard deviation of the masses should 
remain at 0.70 kg. A random sample of 100 chickens is taken from the population and the mass 
X kg of each chicken in the sample is recorded, giving )\x = 190.2. 


a Explain what is meant by a 95% confidence interval for a population parameter 0. (1 mark) 
b Find a 95% confidence interval for p. (3 marks) 


A railway watchdog is studying the number of seconds that express trains are late to arrive. 
Previous surveys have shown that the standard deviation is 50. A random sample of 200 trains 
was selected and gave rise to a mean of 310 seconds late. 
a Find a 90% confidence interval for jz, the mean number of seconds that express 

trains are late. (3 marks) 
Five different independent random samples of 200 trains are selected, and each sample is used 
to generate a different 90% confidence interval for ju. 


b Find the probability that exactly three of these 


: : U itable bi ial distribution. 
confidence intervals contain ju. (2 marks) Hint Sao eae dele ace 


Amy is investigating the total distance travelled by lorries in current use. The standard deviation 
can be assumed to be 15000km. A random sample of 80 lorries was stopped and their mean 
distance travelled was found to be 75 872 km. 


Amy suspects that the population is not normally distributed, but claims that she can still use 
the normal distribution to find a confidence interval for p. 
a State, with a reason, whether Amy is correct. (2 marks) 


b Find a 90% confidence interval for the mean distance travelled by lorries in 
current use. (3 marks) 


It is known that each year the standard deviation of the marks in a certain examination is 13.5 
but the mean mark yp will fluctuate. An examiner wants to estimate the mean mark of all the 
candidates on the examination but he only has the marks of a sample of 250 candidates, which 
gives a sample mean of 68.4. 


a What assumption about the candidates must the examiner make in order to use this 

sample mean to calculate a confidence interval for py? (1 mark) 
b Assuming that the above assumption is justified, calculate a 95% confidence interval 

for p. (3 marks) 
Later, the examiner discovers that the actual value of jz was 65.3. 
c What conclusions might the examiner draw about his sample? (2 marks) 


The battery life in hours of a new model of mobile phone in standby mode is modelled as 
having a uniform distribution on [jz — 10, 4 + 10], where the value of ju is not known. 


a Show that the variance of the battery life is a (3 marks) 


A random sample of 120 phones was tested and the mean battery life on standby was 
78.7 hours. 


b Find a 95% confidence interval for p. (3 marks) 
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A statistics student calculated 95% and 99% confidence intervals for the mean y of a certain 
population but failed to label them. The two intervals were (22.7, 27.3) and (23.2, 26.8). 


a State, with a reason, which interval is the 95% one. (1 mark) 
b Estimate the standard error of the mean in this case. (2 marks) 
c What was the student’s unbiased estimate of the mean ju in this case? (2 marks) 


The managing director of a firm has commissioned a survey to estimate the mean expenditure 
of customers on electrical appliances. A random sample of 100 people were questioned 

and the research team presented the managing director with a 95% confidence interval of 
(£128.14, £141.86). 


The director says that this interval is too wide and wants a confidence interval of total width £10. 
a Using the same value of x, find the confidence limits in this case. (3 marks) 
b Find the level of confidence for the interval in part a. (2 marks) 


The managing director is still not happy and now wishes to know how large a sample would be 
required to obtain a 95% confidence interval of total width no greater than £10. 


c Find the smallest size of sample that will satisfy this request. (3 marks) 


A plant produces steel sheets whose masses are known to be normally distributed with a 
standard deviation of 2.4kg. A random sample of 36 sheets had a mean mass of 31.4 kg. 
Find 99% confidence limits for the population mean. (3 marks) 


A machine is regulated to dispense liquid into cartons in such a way that the amount of liquid 
dispensed on each occasion is normally distributed with a standard deviation of 20 ml. 


Find 99% confidence limits for the mean amount of liquid dispensed if a random sample of 
40 cartons had an average content of 266 ml. (3 marks) 


a The error made when a certain instrument is used to measure the body length of a butterfly 
of a particular species is known to be normally distributed with mean 0 and standard 
deviation | mm. Calculate, to 3 decimal places, the probability that the size of the 
error made when the instrument is used once is less than 0.4 mm. (2 marks) 

b Given that the body length of a butterfly is measured 9 times with the instrument, 
calculate, to 3 decimal places, the probability that the mean of the 9 readings will be 
within 0.5mm of the true length. (3 marks) 

c Given that the mean of the 9 readings was 22.53 mm, determine a 98% confidence 
interval for the true body length of the butterfly. (3 marks) 


Estimation, confidence intervals and tests using a normal distribution 


E Hypothesis testing for the difference between means 


You need to be able to carry out a hypothesis test Links ) 
In your A level course you used 
for the difference between the means of two : : 


normal distributions with known variances. 


the sample mean to carry out hypothesis 
tests for the mean of a single normal 

If, instead of one population, you now have two distribution. © SM2, Section 3.7 
independent populations then you can test 

hypotheses about the difference between the population means. 


In Chapter 4 you saw tha,t if Y and Yare two independent normal distributions with means of jw, and 
ji, and standard deviations o,.and a, respectively, then 


ae aoe N(iL, ~ by, as + o%) 
Now, if Y and Y are sample means based on samples of size n,, and n, respectively from the above 
two normal populations, then 
¥-Y n( ee 
— ~ = ‘ eet + — 
Lyx Ly ny Ny 
and the statistic ¥ — Y can be used to test hypotheses about the values of ju, and Hy. 


The central limit theorem tells you that, provided the sample sizes n, and n, are large, X¥ -Y will 
have a normal distribution whatever the distributions of X and Y. You can therefore use this to test 
whether there is a significant difference between the means of any two populations. The usual null 
hypothesis is that the values of yz, and yu, are equal, but other situations are possible provided that 
the null hypothesis gives you a value for 1, — [1,. 


The test statistic you will need to use is based upon the distribution of Y — Y and is 


Yo) 2,= 
Z= po Ms by) Note } The formula for the test 
o2 0% statistic is in the formula book. 
i hy 


= Test for difference between two means: 


° If X ~ N(y,, 02) and the independent random variable Y ~ N(j,, o?), then a test of the 
null hypothesis Hp: , = #1, can be carried out using the test statistic 


_X-Y~(ne- by) 


a ern 
2 

Ey 

nN, Ny 


* If the sample sizes 1, and n, are large, then the result can be extended, by the central limit 
theorem, to include cases where the distributions of X and Y are not normal. 
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Example 


The weights of boys and girls in a certain school are known to be normally distributed with 
standard deviations of 5 kg and 8 kg respectively. A random sample of 25 boys had a mean weight 
of 48 kg and a random sample of 30 girls had a mean weight of 45 kg. 


Stating your hypotheses clearly, test, at the 5% level of significance, whether there is evidence that 
the mean weight of the boys in the school is greater than the mean weight of the girls. 


Ho? Mboy = Mairi Ai? poy > Megirl The alternative hypothesis is [poy > gin Since 
C= bony = 25, Go = 6 and ts = 30 this is what you are testing for. The null hypothesis 
a is that the two population means are the same. 
The value of the test statistic is 
y=) = ( eo ) | i 
z= an ae [ty — [4 = 0 from the null hypothesis. 
ag G 


iy. By 


y 


_ 48-45 


(2 64 


25 * 30 
3 
~ (BNO SO i. 
= 1.6947... 
The 5% (one-tailed) critical value for Z is 
z= 1.6449 (table on page 214) so this value 


is significant and you can reject Ho and r Watch out | = 
conclude that there is evidence that the 4 Always quote the critical value from 
the tables in full and give your conclusion in 


mean weight of the boys is greater than the 
context. 


mean weight of the girls. 


Sometimes you may be asked to test, for example, whether the mean weight of the boys exceeds 
the mean weight of the girls by more than 2 kg. The test would be similar to the above but the 
hypotheses would be slightly different and this will affect the test statistic. 


Example 


The weights of boys and girls in a certain school are known to be normally distributed with 
standard deviations of 5 kg and 8 kg respectively. A random sample of 25 boys had a mean weight 
of 48 kg and a random sample of 30 girls had a mean weight of 45 kg. 


Stating your hypotheses clearly, test, at the 5% level of significance, whether there is evidence that the 
mean weight of the boys in the school is more than 2 kg greater than the mean weight of the girls. 
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Ho: Mpoy — Mairi = 2 Hy: Lpoy — Kairl et] 
0, = 5, nN, = 25, So = 8 and Nz = 30 


The value of the test statistic is 


X-—¥ — (iy — by) 
—— 


= 0.565 


The 5% (one-tailed) critical value for Z is 

z= 1.6449 (table on page 214) so this value 
is not significant. 

There is insufficient evidence that the mean 
weight of the boys is more than 2kg greater 


than the mean weight of the girls. 


A manufacturer of personal stereos can use batteries made by two different manufacturers. 

The standard deviation of lifetimes for Never Die batteries is 3.1 hours and for Everlasting batteries 
is 2.9 hours. A random sample of 80 Never Die batteries and a random sample of 90 Everlasting 
batteries were tested and their mean lifetimes were 7.9 hours and 8.2 hours respectively. 


Stating your hypotheses clearly, test, at the 5% level of significance, whether there is evidence of a 
difference between the mean lifetimes of the two makes of batteries. 


Let py, be the mean lifetime of Never Die 


batteries and let yu, be the mean litetime of ; ‘ , 
; d You are testing for a difference (in 
Everlasting batteries. : Ta 
either direction) between the means, so use a 
Hot Mx = My yt fy # My two-tailed test. 


G,.=— 3A, n= OO, O, = 2.9 and ny = 90 
x -y=79-682=-0.3 
X — VY — (My — py) 


20. = 


a2 
(12 (29) 
0 * 30 


= -0.649... 
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The 5% (two-tailed) critical values for Z are 
Z= + 19600; 


So this value is not significant and you do 

not reject Ho. You can conclude that there is 
no significant evidence of a difference in the 
mean lifetimes of the two makes of batteries. 


Exercise 


! 


In Questions 1-3 carry out a test on the given hypotheses at the given level of significance. 
The populations from which the random samples are drawn are normally distributed. 
1 Ho: by = ba, Hi: by = L2, Ny = 15, 01> 5.0, Ny = 20, 07> 4.8, 

xX, = 23.8 and X, = 21.5 using a 5% level. 


2 Ho: fi = M2, Hy: py # Me, Mm = 30, 0, = 4.2, ny = 25, on = 3.6, 
xX, = 49.6 and X, = 51.7 using a 5% level. 


3 Ao! a = Me, Ay! py < fo, m = 25, o, = 0.81, ny = 36, 02 = 0.75, 
x, = 3.62 and xX, =4.11 using a 1% level. 


In Questions 4-6 carry out a test on the given hypotheses at the given level of significance. 
Given that the distributions of the populations are unknown, explain the significance of the central 
limit theorem in these tests. 


4 Ho: py = bo, Hy: fy F Mo, 1 = 85, 0; = 8.2, ny = 100, 0, = 11.3, 
x, = 112.0 and X, = 108.1 using a 1% level. 


5 Ho: fi = be, Ay: py > bay Mm, = 100, 2, = 18.3, np = 150, op = 154A, 
x, = 72.6 and X, = 69.5 using a 5% level. 


6 Hg: py = fe, Ag fy < pe, m = 120,90, = 0.013, ny = 90, a, = 0.015, 
xX, = 0.863 and X, = 0.868 using a 1% level. 


(E/P) 7 A factory has two machines designed to cut piping. The first machine works to a standard 


deviation of 0.011 cm and the second machine has a standard deviation of 0.015cm. A random 
sample of 10 pieces of piping from the first machine has a mean length of 6.531 cm anda 
random sample of 15 pieces from the second machine has a mean length of 6.524cm. Assuming 
that the lengths of piping follow a normal distribution, test, at the 5% level, whether the 
machines are producing piping of the same mean length. (7 marks) 


(E/P) 8 A farmer grows wheat. He wants to improve his yield per acre by at least 1 tonne by buying a 


different variety of seed. The variance of the yield of the old seed is 0.6 tonnes’ and the variance 
of the yield of the new seed is 0.8 tonnes”. A random sample of 70 acres of wheat planted with 
the old seed has a mean yield of 5 tonnes and a random sample of 80 acres of wheat planted 
with the new seed has mean yield of 6.5 tonnes. 
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a Test, at the 5% level of significance, whether there is evidence that the mean yield of the 
new seed is more than | tonne greater than the mean yield of the old seed. State your 
hypotheses clearly. (9 marks) 


b Explain the relevance of the central limit theorem to the test in part a. (2 marks) 


An agricultural research scientist investigated whether the diet of cows influenced the amount 

of fat in their milk. The fat content of 60 litres of milk selected at random from cows fed entirely 
on grain had a mean value of 4.1 g per litre and a random sample of 50 litres of milk from cows 
fed on a combination of grain and grass had a mean value of 3.7 g per litre. 


It is known that the variance of fat content for a grain-only diet is 0.8 (g/l)? and that the variance 
of fat content for a grain and grass diet is 0.75 (g/l). 


a Stating your hypotheses clearly and using a 5% level of significance, test whether there is a 
difference between the mean fat content of milk from cows fed on these two diets. 


(7 marks) 
b State, in the context of this question, an assumption you have made in carrying out the test in 
part a. (1 mark) 


Challenge 


Two independent random variables, ¥ and Y, have unknown means 

Hx and j1, respectively. The variances, o¢ and of, are both known. 
Random samples of ,. observations from the random variable X and n, 
observations from the random variable Y are taken. The sample means 
are x and y. 

A hypothesis test is carried out to see if there is a difference in the 
means of the two samples and it is found that the null hypothesis is 
accepted. 


A confidence interval is to be found for the common mean of the two 
samples, ju. 
fi is used to denote the pooled estimate of the population mean, and 


is found by finding the mean of the combined samples. 
y ede ar LO 
a Show that 70) oe 


The distribution of the corresponding random variable is given as 


Given that 1, = 100, n, = 120, ¥ = 46.0, ¥ = 47.0, of = 16.0 and of = 24.0, 
b find the 99% confidence interval for ju. 
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& Use of large sample results for an unknown population 


One of the practical difficulties that you will encounter when carrying out hypothesis tests based on 
samples, and also in finding confidence intervals, is the need to know the value of the population 
standard deviation. In practice, if you do not know yp, it is unlikely that you will know co. Sometimes it 
is reasonable to assume that similar processes or populations will have the same standard deviation 
but perhaps just have an altered mean. Occasionally you can look at historical data and see that over 
a period of time the standard deviation has been constant, and it may be reasonable to assume that 
it remains so but often it may be impossible to choose a reliable value of o. 


In situations such as this, you can use the central limit theorem and the sample variance, s*, which is 
an unbiased estimator of the population variance. 


= If the population is normal, or can be r Watch out | ae ; 
assumed to be so, then, for large samples, : OER Onese teat rely On er 
XY sample sizes, and the second test also relies on 
. iad has an approximate N(0, 12) the central limit theorem. 
vn 


distribution. 
= If the population is not normal, by assuming that s is a close approximation to o, then for 


X- 
large samples, se can be treated as having an approximate N(0, 12) distribution. 


vn 


As part of a study of the health of young schoolchildren, a random sample of 220 children from 
area A and a second, independent random sample of 180 children from area B were weighed. 
The results are given in the table below. 


n x S 
Area A 220 37.8 3.6 
Area B 180 38.6 4.1 


a Test, at the 5% level of significance, whether there is evidence of a difference in the mean weight 
of children in the two areas. State your hypotheses clearly. 


b State an assumption you have made in carrying out this test. 
c Explain the significance of the central limit theorem to this test. 
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a Ho: Ma = Me Ayia * be 


X4—Xp- (tg — fp) 


Test statistic z= 
o4 OB 
74 * Np 

— 38.6 - 37.6 


ee 2ar 
220 180 
2= 2.0499... 
= 7,05 (3:64) = 
Two-tail 5% critical values are z = £1.96 


Since 2.05 > 1.96, the result is significant 
so reject Ho. 


There is evidence that the mean weight of 
children in the two areas is different. 


b The test statistic requires a so you have 


bo Aeon Hee es =o Cae Eee aaa t Watch out ) Note that the central limit theorem 
c You are not told that the populations are is not an assumption — it is a theorem that can 

normally distributed but the samples are be invoked to enable you to use the normal 

both large and so the central limit theorem distribution. 

enables us to assume that X 4 and Xz are The assumption that s* = a2 is reasonable since 

both normal. both samples are large. 


Exercise 5D) 


(E/P) 1 An experiment was conducted to compare the drying properties of two paints, Quickdry and 
Speedicover. In the experiment, 200 similar pieces of metal were painted, 100 randomly allocated 
to Quickdry and the rest to Speedicover. 


The table below summarises the times, in minutes, taken for these pieces of metal to become 


touch-dry. 
Quickdry Speedicover 
Mean 28.7 30.6 
Standard deviation 7.32 3.51 


Using a 5% significance level, test whether the mean time for Quickdry to become touch-dry is 
less than that for Speedicover. State your hypotheses clearly. (6 marks) 


(E/P) 2 A supermarket examined a random sample of 80 weekend shoppers’ purchases and an 
independent random sample of 120 weekday shoppers’ purchases. The results are summarised in 
the table below. 


n 
Weekend 80 38.64 | 6.59 
Weekday 120 | 40.13 | 8.23 


=| 
a 
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a Stating your hypotheses clearly, test, at the 5% level of significance, whether there is evidence 
that the mean expenditure in the week is more than at weekends. (6 marks) 


b State an assumption you have made in carrying out this test. (1 mark) 


It is claimed that the components produced in a small factory have a mean mass of 10 g. 
A random sample of 250 of these components is tested and the sample mean, X, is 9.88 g and the 
standard deviation, s, is 1.12 g. 


a Test, at the 5% level, whether there has been a change in the mean mass of a component. 
(6 marks) 


b State any assumptions you would make to carry out this test. (1 mark) 


Two independent samples are taken from population A and population B. Carry out the 
following tests using the information given. 


a Ho: p4=ug, Hy: 4 < pg using a 1% level of significance 


ng = 90, ng = 110, X4 = 84.1, Xz = 87.9, 54 = 12.5, 5p = 14.6 (5 marks) 
b Ho: 4 - Me = 2, Hy: wy — pg > 2 using a 5% level of significance 

ny = 150, ng = 200, X 4 = 125.1, Xp = 119.3, 54 = 23.2, 5p = 18.4 (5 marks) 
c State an assumption that you have made in carrying out these tests. (1 mark) 


A shopkeeper complains that the average mass of chocolate bars of a certain type that he is 
buying from a wholesaler is less than the stated value of 85.0 g. The shopkeeper measured 

the mass of 100 bars from a large delivery and found that their masses had a mean of 83.6 g 

and a standard deviation of 7.2 g. Using a 5% significance level, determine whether the 
shopkeeper is justified in his complaint. State clearly the null and alternative hypotheses that 

you are using, and express your conclusion in words. (6 marks) 


A health authority set up an investigation to examine the ages of mothers when they give birth 
to their first child. 
A random sample of 250 first-time mothers from a certain year had a mean age of 22.45 years 
with a standard deviation of 2.9 years. A further random sample of 280 first-time mothers taken 
10 years later had a mean age of 22.96 years with a standard deviation of 2.8 years. 
a Test whether these figures suggest that there is a difference in the mean age of 

first-time mothers between these two dates. Use a 5% level of significance. (6 marks) 
b State any assumptions you have made about the distribution of ages of 

first-time mothers. (1 mark) 
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Mixed exercise 5) 


| 


A 
©) 


The masses of bags of lentils, ¥ kg, have a normal distribution with unknown mean jy kg and a 
known standard deviation okg. A random sample of 80 bags of lentils gave a 90% confidence 
interval for pz of (0.4533, 0.5227). 


a Without carrying out any further calculations, use this confidence interval to test 
whether yu = 0.48. State your hypotheses clearly and write down the significance 
level you have used. (3 marks) 


A second random sample of 120 of these bags of lentils had a mean mass of 0.482 kg. 
b Calculate a 95% confidence interval for jz based on this second sample. (6 marks) 


The lengths of the tails of mice in a pet shop are assumed to have unknown mean p and 
unknown standard deviation o. 


A random sample of 20 mice is taken and the length of their tails recorded. 
The sample is represented by X), X5, ..., X9. 
a State whether or not the following are statistics. 


Give reasons for your answers. 


20 
i — a ii >> - 2) iii > (4 marks) 
1 
; a 4X, = Xn 
b Find the mean and variance of gs (3 marks) 


The breaking stresses of rubber bands are normally distributed. 
A company uses bands with a mean breaking stress of 46.50 N. 
A new supplier claims that they can supply bands that are stronger and provides a sample of 
100 bands for the company to test. The company checked the breaking stress, X, for each of 
these 100 bands and the results are summarised as follows: 
n=100 SX =4715 >> X2 = 222 910 

a Test, at the 5% level, whether there is evidence that the new bands are stronger. (6 marks) 
b Find an approximate 95% confidence interval for the mean breaking stress of these 

new rubber bands. (3 marks) 


On each of 100 days, a conservationist took a sample of | litre of water from a particular 
place along a river, and measured the amount, X mg, of chlorine in the sample. The results she 
obtained are shown in the table. 


X 1 2 3 4 5 6 7 8 9 
Number of days | 4 8 | 20 | 22 | 16 | 13 | 10 |] 6 


a Estimate the mean amount of chlorine present per litre of water, and estimate, to 3 decimal 
places, the standard error of this estimate. (3 marks) 
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b Obtain approximate 98% confidence limits for the mean amount of chlorine present 
per litre of water. (3 marks) 


Given that measurements at the same point under the same conditions are taken for a further 
100 days, 
c estimate, to 3 decimal places, the probability that the mean of these measurements 

will be greater than 4.6 mg per litre of water. (3 marks) 


The amount, to the nearest mg, of a certain chemical in particles in the atmosphere at a 
meteorological station was measured each day for 300 days. The results are shown in the table. 


Amount of chemical (mg) 12 | 13 14} 15 | 16 


Number of days 5 42 | 210 | 31 12 
Estimate the mean amount of this chemical in the atmosphere, and find, to 2 decimal 
places, the standard error of this estimate. (3 marks) 


From time to time a firm manufacturing pre-packed furniture needs to check the mean distance 
between pairs of holes drilled by a machine in pieces of chipboard to ensure that no change has 
occurred. It is known from experience that the standard deviation of the distance is 0.43 mm. 
The firm intends to take a random sample of size n, and to calculate a 99% confidence interval 
for the mean of the population. The width of this interval must be no more than 0.60 mm. 


Calculate the minimum value of n. (4 marks) 


The times taken by five-year-old children to complete a certain task are normally distributed 
with a standard deviation of 8.0s. A random sample of 25 five-year-old children from school A 
were given this task and their mean time was 44.2. 
a Find 95% confidence limits for the mean time taken by five-year-old children from 

school A to complete this task. (3 marks) 
The mean time for a random sample of 20 five-year-old children from school B was 40.9 s. 
The headteacher of school B concluded that the overall mean for school B must be less than 
that of school A. Given that the two samples were independent, 
b test the headteacher’s conclusion using a 5% significance level. State your hypotheses clearly. 

(6 marks) 


The random variable X is normally distributed with mean jz and variance o°. 
a Write down the distribution of the sample mean Y of arandom sample of sizen. (1 mark) 
b State, with a reason, whether this distribution is exact or is an estimate. (1 mark) 
An efficiency expert wishes to determine the mean time taken to drill a fixed number of holes in 
a metal sheet. 
c Determine how large a random sample is needed so that the expert can be 95% certain 
that the sample mean time will differ from the true mean time by less than 15 seconds. 
Assume that it is known from previous studies that o = 40 seconds. (4 marks) 


A commuter regularly uses a train service which should arrive in London at 09:31. He decided 
to test this stated arrival time. Each working day for a period of 4 weeks, he recorded the 
number of minutes X that the train was late on arrival in London. If the train arrived early 
then the value of Y was negative. His results are summarised as follows: 


n=20 Yix=15.0 Sox2=103.21 
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a Calculate unbiased estimates of the mean and variance of the number of minutes 
late of this train service. (3 marks) 


The random variable X represents the number of minutes that the train is late on arriving 
in London. Records kept by the railway company show that over fairly short periods, the 
standard deviation of X is 2.5 minutes. The commuter made two assumptions about the 
distribution of X and the values obtained in the sample and went on to calculate a 95% 
confidence interval for the mean arrival time of this train service. 


b State the two assumptions. (2 marks) 
c Find the confidence interval. (3 marks) 
d Given that the assumptions are reasonable, comment on the stated arrival time of 

the service. (1 mark) 
The random variable X is normally distributed with mean jz and variance o?. 
a Write down the distribution of the sample mean Y of arandom sample of sizen. (1 mark) 
b Explain what you understand by a 95% confidence interval. (2 marks) 


A garage sells both leaded and unleaded petrol. The distribution of the values of sales for 
each type is normal. During 2010 the standard deviation of individual sales of each type of 
petrol was £3.25. The mean of the individual sales of leaded petrol during this time was £8.72. 
A random sample of 100 individual sales of unleaded petrol gave a mean of £9.71. 


Calculate: 
c an interval within which 90% of the sales of leaded petrol will lie, (3 marks) 
d a 95% confidence interval for the mean sales of unleaded petrol. (3 marks) 


The mean of the sales of unleaded petrol for 2009 was £9.10. 


e Using a 5% significance level, investigate whether there is sufficient evidence to conclude 
that the mean of all the 2010 unleaded sales was greater than the mean of the 2009 sales. 
(6 marks) 


f Find the size of the sample that should be taken so that the garage proprietor can be 
95% certain that the sample mean of sales of unleaded petrol during 2010 will differ 
from the true mean by less than 50p. (4 marks) 


a Explain what is meant by a 98% confidence interval for a population mean. (2 marks) 


The lengths, in cm, of the leaves of willow trees are known to be normally distributed with 
variance 1.33 cm?. 


A sample of 40 willow tree leaves is found to have a mean of 10.20cm. 
b Estimate, giving your answer to 3 decimal places, the standard error of the mean. (2 marks) 


c Use this value to estimate symmetrical 95% confidence limits for the mean length of 
the population of willow tree leaves, giving your answer to 2 decimal places. (3 marks) 


d Find the minimum size of the sample of leaves which must be taken if the width of 
the symmetrical 98% confidence interval for the population mean is at most 1.50cm. 
(4 marks) 
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fy 12 a Write down the mean and the variance of the distribution of the means of all possible 


(E/P) samples of size n taken from an infinite population having mean py and variance co. (2 marks) 
b Describe the form of this distribution of sample means when 
i nis large 
ii the distribution of the population is normal. (2 marks) 


The standard deviation of all the till receipts of a supermarket during 2014 was £4.25. 


c Given that the mean of a random sample of 100 of the till receipts is £18.50, obtain an 
approximate 95% confidence interval for the mean of all the till receipts during 2014. 


(3 marks) 
d Find the size of sample that should be taken so that the management can be 95% confident 
that the sample mean will not differ from the true mean by more than 5Op. (3 marks) 


e The mean of all the till receipts of the supermarket during 2013 was £19.40. Using a 5% 
significance level, investigate whether the sample in part a provides sufficient evidence to 
conclude that the mean of all the 2014 till receipts is different from that in 2013. (6 marks) 


(E/P) 13 Records of the diameters of spherical ball bearings produced on a certain machine indicate that 
the diameters are normally distributed with mean 0.824cm and standard deviation 0.046 cm. 
Two hundred samples are chosen, each consisting of 100 ball bearings. 
a Calculate the expected number of the 200 samples having a mean diameter less than 

0.823 cm. (2 marks) 

On a certain day it was suspected that the machine was malfunctioning. It may be assumed that 
if the machine is malfunctioning it will change the mean of the diameters without changing 
their standard deviation. On that day a random sample of 100 ball bearings had mean diameter 


0.834 cm. 
b Determine a 98% confidence interval for the mean diameter of the ball bearings being 
produced that day. (3 marks) 


c Hence state whether or not you would conclude that the machine is malfunctioning on that 
day given that the significance level is 2%. (3 marks) 


(E/P) 14 A cardiologist claims that there is a higher mean heart rate in people who always drive to 
work compared to people who regularly walk to work. She measures the heart rates, X, of 30 
people who always drive to work and 36 people who regularly walk to work. Her results are 
summarised in the table below. 

n x s? 

Drive to work 30 52 60.2 

Walk to work 36 47 55.8 


a Test, at the 5% level of significance, the cardiologist’s claim. State your hypotheses clearly. 
(6 marks) 


b State any assumptions you have made in testing the cardiologist’s claim. (2 marks) 


The cardiologist decides to add another person who drives to work to her data. She measures 
the person’s heart rate and finds X = 55. 


c Find an unbiased estimate of the variance for the sample of 31 people who drive to work. 
Give your answer to 3 significant figures. (4 marks) 
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Estimation, confidence intervals and tests using a normal distribution 


Challenge 


Two independent random samples X;, X>,..., X, and Y;, Y>,..., Y,, are 
taken from a population with mean y and variance o*. The unbiased 
estimators X¥ and Y of py are calculated. A new unbiased estimator T of 
p is sought of the form T=rX +s. 


a Show that, since 7 is unbiased, r+s5= 1. 
b By writing T=rX + (1-1) Y, show that 


(2 2 
Ss is) 
Var(T) = (. ar 
a : : n 
c Show that the minimum variance of T is when (Sea 


d Find the best (in the sense of minimum variance) unbiased estimator 
of of the formrX + s¥Y. 


Summary of key points 


1 If Xis arandom variable, then a random sample of size n will consist of m observations of the 
random variable X, which are referred to as Xj, X>, X3, ..., X,, where the X; 


* are independent random variables 
* each have the same distribution as X. 


A statistic, T, is defined as a random variable consisting of any function of the X; that 
involves no other quantities, such as unknown population parameters. 


2 The sampling distribution of a statistic Tis the probability distribution of 7: 


3 Astatistic that is used to estimate a population parameter is called an estimator and the 
particular value of the estimator generated from the sample taken is called an estimate. 


4 lf a statistic Tis used as an estimator for a population parameter @ and E(7) = @ then Tis an 
unbiased estimator for 0. 


5 lf astatistic Tis used as an estimator for a population parameter 0 then the bias = E(T) — 0. 
For an unbiased estimator, the bias is 0. 


6 An unbiased estimator for 0 is given by the sample variance S* where 


UGE 
i 


> 
© S71 = 


7 The standard deviation of an estimator is called the standard error of the estimator. 


8 When you are using the sample mean, YX, you can use the following result for the standard 
error: 


Standard error of X =-2 or & 
n vn 
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9 Aconfidence interval for a population parameter @ is a range of values defined so that there 
is a specific probability that the true value of the parameter lies within that range. 


10 A 95% confidence interval for the population mean ju is (* -~1.96x 4, ¥ +196 x ak 


vn vn 


11 The width of a confidence interval is the difference between the upper confidence limit and 
the lower confidence limit. This is 2 x z x 2, where z is the relevant percentage point from 


n 
the standard normal distribution, for example 1.96, 1.6449, etc. 


12 Test for difference between two means: 


> If X~N(u,, o%) and the independent random variable Y ~ N(y,, 3), then a test of the null 
hypothesis H,; 2, = 44, can be carried out using the test statistic 


X - Y - (ux - p,) 
Z= z 


: If the sample sizes n, and n, are large then the result can be extended, by the central limit 
theorem, to include cases where the distributions of X and Y are not normal. 


13 - If the population is normal, or can be assumed to be so, then, for large samples, 


|! 
S 


va 


- If the population is not normal, by assuming that s is a close approximation to o, 


has an approximate N(0, 12) distribution. 


then & can be treated as having an approximate N(0, 1°) distribution. 


Ss 


a] 
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Further hypothesis 
tests 


After completing this chapter, you should be able to: 


e@ Find aconfidence interval for the variance of a 


normal distribution — pages 142-146 
@ Conduct a hypothesis test for the variance of a 

normal distribution — pages 146-149 

Understand and use the F-distribution — pages 149-155 


Carry out an F-test to test whether two independent 
random variables are from normal distributions with 
equal variances — pages 155-159 


eG 


- 
~y 


> 


Prior knowledge check : 


: =| 1 A sample of size 5 was taken from a 
population with unknown mean and 
variance. 
21 25) 27 28 


Find an unbiased estimate for: 


Sil 


a the population mean 
b the population variance. € Chapter 5 


> 
2 Therandom variable X has a chi-squared 
distribution with 7 degrees of freedom. 


Find x such that PLY > x) = 0.975 


_ Customers in supermarkets like to € FS1, Chapter 6 


buy fresh produce of a uniform size. 
You could use samples of fruit from 
different suppliers to test which 
supplier was likely to provide more 
consistently sized produce. 

— Section 6.3 


3 Arandom sample of size 30 is taken from a 
normal population with standard deviation 
of 4.2. The mean of the sample was 17.5. 


Find a 95% confidence interval for the 
population mean p. € Chapter 5 


u”*e ] | 


Chapter 6 


G& Variance of a normal distribution 


If you take a sample of n independent observations X;, X>, ..., X,, with sample mean_X, then 


1 


a 
2 “n-1 


n 
>> (X;— XY)? is an unbiased estimator of the population variance o2. 
i=1 


Different samples from the same population will give different estimates for a2, so in the 
same way that x is a particular value of the random variable X, s* is a particular value of a random 


variable S?. 
t Watch out ) To find confidence intervals for 


In order to find a confidence interval for o2 you the population mean, you used the central limit 
need to know something about the distribution ansoeii toes er unsectule tatu cel Ce 
of S2 approximately normally distributed. You cannot 


apply the central limit theorem to S?. 


The distribution of S* is not easy to find, but if X 

is normally distributed, then the distribution of You have previously used the chi-squared 
(n—1)S?. : eee . family of distributions to model goodness of fit. 
a is a chi-squared distribution with © FS1, Chapter 6 


n— 1 degrees of freedom. 


= If arandom sample of 7 observations X,, X;, ..., X,, is selected from N(j, 07), then 
(mn — 1),S? 
o2 
Percentage points for the chi-squared distribution are given in the table in the formulae booklet, and 
on page 215. The number down the left-hand side is the number of degrees of freedom, which should 
be equal ton — 1. 


2 
Xn-1 


0.995 | 0.990 | 0.975 | 0.950 | 0.900 | 0.100 | 0.050 | 0.025 | 0.010 | 0.005 
0.000} 0.000} 0.001} 0.004} 0.016) 2.705} 3.841} 5.024) 6.635] 7.879 
0.010} 0.020) 0.051} 0103} 0.211) 4.605; 5.991) 7.378) 9.210} 10.597 
0.072} 0.115) 0.216} 0.352} 0.584) 6251} 7.815] 9.348} 11.345 | 12.838 

0.207} 0.297) 0484} 0.711} 1.064) 7.779}; 9.488 | 11.143 | 13.277 | 14.860 

0.412} 0.554) 0.831} 1.145} 1.610) 9.236 on 12.832 | 15.086 | 16.750 

0.676} 0.872) 1.237] , 1.635} 2.204) 10.645 | |12.592 | 14.449 | 16.812 | 18.548 

0.989} 1.239) 1.690} |2.167} 2.833) 12.017 | |14.067 | 16.013 | 18.475 | 20.278 


Nialulnlwlnie/e 


Because the x? distribution is non-symmetric, both tails of the distribution are given in the table. 


You can use the percentage points of the Note } You can also find the percentage points for 
chi-squared distribution to find confidence the chi-squared distribution on some graphical 


intervals for the variance of a normally calculators. 
distributed population. 
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ita ae = 1) S* 
If arandom sample X,, X>, ..., X,, is taken from an N(y, c) distribution, then = hasax? , 
Oo 
distribution. Using the table of percentage points of the distribution x7_,, you can find given values 


— 1) S? 
feat such that the interval between them will contain 0? with 95% probability. 
Oo 


Thus, assuming that the two tails are of equal size, 


— 1) S? 
x2_, (0.975) < ous. < x2_, (0.025) 


We want to get bounds for 0? so we isolate it by 


1 o® 1 1 turning the fractions upside down and 
x5.,10.025) W—1)S? 42. 0,975) reversing the inequality. 
(n — 1) S? (n — 1) S? 


a 


x2_, (0.025) 7 ~~ x2_, (0.975) 


2 multiplying by (m — 1) S* 


If you have a specific estimate s2, this becomes 
(n— 1)s? > _ —1)s? 
a 2h ee ee 
Xj_1 (0.025) x2_, (0.975) 
(n— 1)s* (n — 1) s2 


The values and are the lower and upper 95% confidence limits respectively. 
x2_, 0.025)“ x2_, (0.975) ie poy, 


(n — 1)s? (n — 1)s? 
x2_,(0.05) —x2_, (0.95) 


In a similar way, the 90% confidence limits are 


= Generally, for a probability of a that the variance falls outside the limits, 
¢ the 100(1 — a)% confidence limits are 


¢ the 100(1 — a)% confidence interval for the variance of a normal distribution is 


(n—1)s* (n—1)s? 
ea Xz-(2 5] 
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Example 


In order to determine the accuracy of a new rifle, 8 marksmen were selected at random to fire the 
rifle at a target. The distances x, in mm, of the 8 shots from the centre of the target were as follows: 


10, 14, 12, 8, 6, 11, 18, 14 


Assuming that the distances are normally distributed, find a 95% confidence interval for the variance. 


xX = 11.625 s? = 14.2679 
x5(0.975) = 1.690 x5(0.025) = 16.013 


(n — 1)s2 7 x 14.2679 
= = 6.237 (|... — oI 
(n — 1)s2 7 x 14.2679 
= = 591097 Git 
x2_,(0975) 1.690 : 


The 95% contidence interval for the variance 
is (6.237, 59.098). 


Example 
A company manufactures 12 amp electrical fuses. 


A random sample of 10 fuses was taken from a batch and the failure current, x, measured for each. 
The results are summarised below: 


Sox = 118.9 Sox? = 1414.89 


Assume that the data can be regarded as a random sample from a normal population. 
a Calculate an unbiased estimate for the variance of the batch based upon the sample. 


b Use your estimate from part a to calculate a 95% confidence interval for the standard deviation. 


P s{1414.69 = ee 


= 0.1299 (4 d.p) 
b The percentage points are 
NCCT) = 2./00 and oe (0;025) = 19.023+ 
The critical points are 
(n — 1)s? 9x 0.1299 
= =O433 
hile (0.975) Cl 
(n= 1)s* 9 x 0.1299 
= = 0.0615 
x° (0025) 19:023 
The 95% confidence interval for the variance 
is (0.062, 0.433). 
Hence the 95% confidence interval for the 


standard deviation is (0.249, 0.658). 
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Exercise 6A) 


A 
1 


Ep) 7 


(E/P) 9 


A random sample of 15 observations of a normal population gave an unbiased estimate for the 
variance of the population of s* = 4.8. Calculate a 95% confidence interval for the population 
variance. 


A random sample of 20 observations of a normally distributed variable XY is summarised by 
Sox = 132.4 and 5° x? = 884.3. Calculate a 90% confidence interval for the variance of X. 


A random sample of 14 observations were taken from a population that is assumed to be 
normally distributed. The resulting values were: 


Zidig Oy Didy 2ady 20, 25, 23, 39,24, 30, 2ly 2aly 32, 3A 
Calculate a 95% confidence interval for the population variance. 


A random sample of female voles was trapped in a wood. Their lengths, in centimetres 
(excluding tails), were 7.5, 8.4, 10.1, 6.2 and 8.4. 


Assuming that this is a sample from a normal distribution, calculate a 95% confidence 
interval for the variance of the lengths of female voles. (4 marks) 


A random sample of 10 is taken from the annual rainfall figures, x cm, in a certain district. 
The data can be summarised by }>x = 621 and >) x? = 38 938. 
a Calculate 90% confidence limits for the variance of the annual rainfall. (4 marks) 


b What assumption have you made about the distribution of the annual rainfall in 
part a? (1 mark) 


A new variety of small daffodil is grown in the trial ground of a nursery. During the flowering 
period, a random sample of 10 flowers was taken and the lengths, in millimetres, of their stalks 
were measured. The results were as follows: 

266, 254, 215, 220, 253, 230, 216, 248, 234, 244 


Assuming that the lengths are normally distributed, calculate a 95% confidence interval 
for the variance of the lengths. (4 marks) 


A random sample of 8 cats was taken and the lengths of their tails, x cm, were measured. 

It was found that }>x = 234. 

Assuming that the lengths of cats’ tails are normally distributed and given that the lower limit 
for the 95% confidence interval for the population variance is 7.9623, find }>x?. (5 marks) 


A hotel owner chooses 25 randomly selected dates and finds that the standard deviation of the 
number of rooms occupied is 6.21. 

Given that the number of rooms occupied is normally distributed, find, correct to 

3 significant figures, a 90% confidence interval for the population standard deviation. (5 marks) 


Francine works on a production line and is responsible for calibrating a machine that produces 
aluminium fastenings. She selects a random sample of 8 fastenings and records the diameter, in 
millimetres, of each one. 


6.1, O08) 09, CA; 6.3, 6, 6.7, 6.1 
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a Calculate a 99% confidence interval for the standard deviation of the diameters of 
the fastenings. (6 marks) 


Given that the diameters of the fastenings are independent, 
b state what further assumption is necessary for this confidence interval to be valid. (1 mark) 


Francine requires the standard deviation of the diameters to be less than 0.15 mm, otherwise she 
must recalibrate the machine. 


c Use your answer to part a to decide if the machine needs recalibrating. (1 mark) 


6.2 ) Hypothesis testing for the variance of a normal distribution 


You need to be able to carry out a hypothesis test for the variance of a normal distribution. 


Suppose that a manufacturer of pistons for cars had a machine that finished the diameter of the 
piston to a specific size. The machine was set up so that it produced the pistons with a diameter that 
was normally distributed with mean 60 mm and variance 0.0009 mm. After the machine had been 
running for some time, a sample of 15 pistons was taken and the mean of the size of the pistons in 
the sample was still 60 mm, but the best estimate of the variance calculated from the sample was 
0.002 mm?. The manufacturer wishes to know whether the variance has increased. 


Putting the manufacturer's question in the form of hypotheses Note ] Fre mnaaterae 


youieet hypotheses are framed in terms 


Ho: o* = 0.032 H,: 02 > 0.032 of the population variance o%. 


n— 1)S2 
If Hy is assumed true, then cel will be a single observation from a y* distribution. 
o 


Since in this case s* > a2, you need to consider how likely you would be to get the calculated 


n—1)S? 
value eo if Hy were true. The critical value separating the acceptance and rejection regions 
oO 


will be the relevant percentage point of the x?_, distribution. 


In this case, v=n—1=15-—1=14and weare using a 5% level of significance. 


(n- 1) 
fons 


2 
From the table on page 215, x, (0.05) = 23.685 and the critical region is = 23.685. 


A 


0.05 
0 sa 
The value of the test statistic will be paar 
In the manufacturer's case “= wi = eee =31.11 
or 0.032 


31.11 is in the critical region so the result is significant and Hp is rejected. There is evidence that the 
variance has increased. 
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Note that there are seven steps to be followed: 


1 
2 
3 
4 
5 
6 


Write down the null hypothesis (H,). 

Write down the alternative hypothesis (H,). 
Specify the significance level. 

Write down the degrees of freedom v. 
Write down the critical region. 


Identify the population variance, a2, and the unbiased estimate, s*, and calculate the value of the 
_.. (n-1)s? 
test statistic ——~— 
Oo 
7 Complete your test and state your conclusions, stating whether Hy is accepted or rejected, and 


interpreting this in the context of the question. 


Example €y 


A random sample of 12 observations is taken from a normal distribution with a variance of o?. 
The unbiased estimate of the population variance is calculated as 0.015. 


Test, at the 5% level, the null hypothesis that o? = 0.025 against the alternative hypothesis that 
o* £0025, 


Ho: 7? = 0.025 Hy; 02 #0025 + 
A 


t Watch out ] The x? distribution is not 


symmetrical. For a two-tailed test, you will 
have to use the table of values twice to find 
the critical values. 


Significance level 5% (2.5% at each tail) + 
ali 


From table: 
x (O25) = 21220 
be (0.975) = 3.616 


The critical region is 
= Nise =) 5" 

a ee ee 
o 

g?=0,025 s* = 0.015 

n— 1)s@ We =TW)OOIS _ 
g 0025 

3.816 <6.6 < 21.920 

6.6 is not in the critical region so there 


< 3.616 


Test statistic 


is insufficient evidence for rejecting Ho. 


There has been no change in the variance. 
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Exercise 6B) 


A 
1 


(E/P) 5 


Twenty random observations (x) are taken from a normal distribution with variance o?. 
The results are summarised as follows: 


dox = 332.1 Sox? = 5583.63 
a Calculate an unbiased estimate for the population variance. 
b Test, at the 5% significance level, Hp: 0? = 1.5 against H,: 0? > 1.5. 


A random sample of 10 observations is taken from a normal distribution with variance o?. 
The variance is thought to be equal to 0.09. The results were as follows: 


0.35, 0.42, 0.30, 0.26, 0.31, 0.30, 0.40, 0.33, 0.30, 0.40 
Test, at the 2.5% level of significance, Hy: o? = 0.09 against H,: 0? < 0.09. 


The following random observations are taken from a normal distribution which is thought to 
have a variance of 4.1: 


2.1, 2.3, 3.5, 4.6, 5.0, 6.4, 7.1, 8.6, 8.7, 9.1 
Test, at the 5% significance level, Hp: c? = 4.1 against H,: 0? 4 4.1. 


It is claimed that the masses of a particular component produced in a small factory are normally 
distributed and have a mean mass of 10 g and a standard deviation of 1.12 g. 

A random sample of 20 such components was found to have a variance of 1.15 g. 

Test, at the 5% significance level, Hp: o? = 1.12? against H,: 0? 4 1.127. 


Rollers for use in roller bearings are produced on a certain machine. The rollers are supposed 
be normally distributed and to have a mean diameter (1) of 10 mm with a variance (7) of 
0.04 mm?. 


A random sample of 15 rollers is taken and their diameters, x mm, were measured. The results 
are summarised below: 


Sox = 149.941 5 °x2 = 1498.83 


a Calculate unbiased estimates for jz: and o?. (3 marks) 
b Test, at the 5% significance level, the hypothesis a? = 0.04 against the hypothesis 07 # 0.04. 
(5 marks) 


The diameters of the eggs of the little gull are approximately normally distributed with mean 
4.11 cm with a variance of 0.19 cm?. 
A sample of 8 little gull eggs from a particular island which were measured had diameters in 
centimetres as follows: 
4.4, 4.5, 4.1, 3.9, 4.4, 4.6, 4.5, 4.1 

a Calculate an unbiased estimate for the variance of the population of little gull eggs 

on the island. (2 marks) 
b Test, at the 10% significance level, the hypothesis a7 = 0.19 against the hypothesis 

a7 £0.19. (5 marks) 
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fQ7 Fishing line produced by a certain manufacturer is known to have a mean tensile breaking 
(E) strength of 170.2 kg and standard deviation 10.5 kg. The breaking strength of the fishing line is 
normally distributed. 


A new component is added to the material which will, it is clatmed, decrease the standard 
deviation without altering the tensile strength. A random sample of 20 pieces of the new fishing 
line is selected. Each piece is tested to destruction and the tensile strength of each piece is noted. 
The results are used to calculate unbiased estimates of the mean strength and standard deviation 
of the population of new fishing line. These were found to be 172.4kg and 8.5kg. 


a Test, at the 5% level, whether the variance has been reduced. (6 marks) 
b What recommendation would you make to the manufacturer? (1 mark) 


(E/P) 8 The masses of three-month-old Jack Russell puppies, X kg, are normally distributed 
with X¥ ~ N(y, 07). 


A random sample of 10 puppies is taken and their masses at three months old, x kg, 
are recorded. The results are summarised as follows: 


> x= 32.12 > x= 103.8592 


a Find an unbiased estimate for yw, and calculate the standard error of your estimate. (3 marks) 


b Stating your hypotheses clearly, test at the 5% level of significance whether the 
standard deviation of the masses of puppies is different from 0.25 kg. (6 marks) 


&) The F-distribution 


Customers in supermarkets like to buy produce of a uniform size. This means that the manager of a 
supermarket is not only concerned with the mean size of, for example, apples, but also the variance. 


Given two suppliers, the one likely to be chosen is the one with the lower variance. 


Given also that the manager can only take a sample from each manufacturer, how can she tell 
whether one variance is larger than the other? 


The F-test is used to determine if two independent random samples from normal distributions have 
equal variance. The F-test will be covered in Section 6.4. The F-test is based on the F-distribution, 
which will be covered in this section. 


Begin by assuming that the samples are independent and from normal distributions. 


Suppose that you take a random sample of n,. observations from an N(y,, 72) distribution and, 
independently, a random sample of n, observations from an N(y,, 0) distribution. Unbiased 
estimators for the two population variances are S? and S?. 


Using the result from Section 6.1, 


i= DS og F (ny — 1) SS 
~ Xn,-1 an ad Xn,-1 
a2 oF 
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(1, — 1) S? 
It follows from this that i Xn-1 
ollows from this tha ~ 
(ny = 1) s; xi =1 
oy 
S2 x SF Ga 
iste) =n an —~n 
Sy i St paar 
a Ang) oe = 1) 
a 
nA 
The distribution ( a ) 
ny-1 
(ny ~~ 1) 


or by F,, ,, where v, =(n, — 1) and v2 = (n, — 1), 


v2 


ie. ifm, = 13 andn, = 9 the distribution would be 


an Fi s-distribution. 


has two parameters, (7, — 1) and (n, — 1), and is usually denoted by F, 1-1 


Note | Distributions of this type were first studied 


by Sir Ronald Fisher and hence are referred to as 
F-distributions. 


= For arandom sample of 7, observations from an N(j1,, 02) distribution and an independent 
random sample of 7, observations from an N(j,, 07) distribution, 


S2/a2 


es -1,n,-1 
S2/o2 ny— 4, Ny 
yory 


2 


0; 
If you assume that 0? = 0, then = =1 and 
oO 


x 
St 


‘S2 ms F,, %, ny-1 
y 


= If arandom sample of 7,. observations 
is taken from a normal distribution with 
unknown variance co? and an independent 
random sample of 1, observations is taken 
from a normal distribution with equal but 
unknown variance, then 


St 


a 
SS 


F 


n,-1,n,-1 


The F-distribution has two parameters, 
V,=n,-— 1 and v2=n, -— 1, and to get all 


distributions relating to all possible combinations 


of v, and v, would require very extensive tables. 
However, the F-distribution is used mainly in 


Note ) This distribution is given in the formulae 


booklet. 


t Watch out ) Fr, =1,n,-1 FF n,-1,n,-1» 50 the order of 


the parameters in the F-distribution is important. 
If S2 is on top of the fraction, then n, — 1 comes 
first, and if S? is on top of the fraction then n, — 1 
comes first. 


{ Notation ] The numbers of degrees of freedom, 


V, and v>, are used here because that is how they 
are described on the tables, but v, and v,, could 
equally well be used. 


hypothesis testing for variances, and so you are not really interested in all values of F,, 


150 


Further hypothesis tests 


The values of F,,_,, that are of interest are the critical values, which are exceeded with probabilities of 
5%, 1% etc. These critical values are written F,, ,, (0.05), F,, ,, (0.01), etc. 


A 


0.05 
> Note | Some graphical calculators might 
O 1 F,, d She 
ve also be able to give you critical values 
F,,,v,(0.05) critical value for an F-distribution. 


A separate table is given for each significance level (the table on page 218 is for the 1% (0.01) and 
5% (0.05) significance levels). The first row at the top gives values of v, (remember that v, comes first 
after the F), and the first column on the left gives values of 1. Where row and column meet gives 
the critical value corresponding to the significance level of that table. A short extract from the 0.05 
significance level table is given below. 


Probability |, | 4 2 3 4 5 6 8 10 12 24 
1 | 1614 199.5 215.7 224.6 230.2 234.0 2389 241.9 243.9 249.1 2543 

2 / 1851 19.00 19.16 19.25 19.30 19.33 1937 19.40 19.41 19.46 19.50 

3/1013 955 928 912 901 894 885 879 874 864 8.53 

4 | 7.71 694 659 639 626 616 604 596 591 577 5.63 

5 | 661 579 541 519 5.05 495 482 474 468 453 437 

6 | 599 514 476 453 439 428 415 406 4.00 3.84 3.67 

7 | 550 474 igels 7 lser es ene GA sora ss 

8 | 532 4.46 3.84 3.69 358 344 335 328 312 2.93 

aa 9 | 512 426 3.23 3.14 3.07 290 271 


' Online ) Expore the F-distribution using 
GeoGebra and use it to determine critical 
Example G values of the sample variances. 
Use the table to find: 
a the Fs; (0.05) critical value b the Fy ;(0.05) critical value. 


a Fs; g(0.05) critical value = 3.69 
b Fs, 5(O0.05) critical value = 4.62 
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The F-distribution tables show you critical values which are exceeded with a given probability. 


So, for example, if fis an observation from a F,, ,, distribution, then P(f> F,,, ,, (0.05)) = 0.05. 


v2 


Notation | A critical value F,,, ,, (p) which is exceeded with probability p 


is sometimes called an upper critical value for p. The value F,, ,, (1 — p) 
for which the observation is less than F,, ,, (1 — p) with probability p is 


Eig 


sometimes called a lower critical value for p. 


You are also interested in finding values at the other end of the distribution. For example, a value 
F,,, v, (0.95) for which P(f> F,, ,, (0.95)) = 0.95, or equivalently P(f< F,, ,, (0.95)) = 0.05. This is shown 
on the diagram below. 


v2 v2 


0 7 1 Fy 


F,, ,, (0.95) upper critical value or 5% lower critical value. 


You can use the table to find lower critical values such as this as follows: 


P(F, ,, <x) =PI|—<»x 


v2 


If aand b are both positive then a < b => ; > ; 


=P(F,,,,>+) 


Vay x 


So the lower critical value for p in an F,, ,, distribution is the same as the reciprocal of the upper 
critical value for pin an F,, ,, distribution. In other words, 


ay 


mF, (P= ——— t Watch out } The order of v, and vz have 


Fy,,v,(1—p) changed on the right-hand side. 
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Example @ 


Find upper critical values for: 


a Fy 19 (0.95) b Fo, (0.95) 
a Wl 
a Fy 49(0.95) critical value = 335° 
=O: 298 Sin 
= 0.30 (2 dip) 
+i 4 
b Fio,a(0.25) critical value = 307 
SO S257 os 
= 0.33 (2 dp) 


Example 


Find the lower and upper 5% critical values for an F,, ,-distribution in each of the following cases: 


a a=6,b=10 b a=12,b=8 
a The upper critical value is Fe i9(0.05) = 3.22 
The lower 5% critical value is Fe 19(0.95) = ae “ x5E = 025 (2 dy) 
b The upper critical value is Fiz, 3(0.05) = 3.26 
The lower 5% critical value is Fie 9(0.95) = = SS = 0.35 (2 dp) 


Example i 


The random variable X follows an F-distribution with 8 and 10 degrees of freedom. 


: 1 
Find Pe a a 5.06] 


Looking at the upper tail P(X > 5.06) = P(Fs, 10 > 5.06) 
PLY > 5.06) = 0.01 
P(X < 5.06) =1- 0.01 = 0.99 


Looking at the other tail P(X < si) = PlFs,10 < =a) 
= Figg > 5.81) 


a2 P(X < si) = 0.01 

hee Plea < ¥ < 5.06) = P(X < 5.06) - P(X < =) 
= 0.99 - 0.01 
— O28 


153 


Chapter 6 


Example 8) 


is The random variable X follows an F¢ ;>-distribution. 
Find P(X < 0.25). 


P(X < 0.25) = P(Fe12 < 0.25) = P(Fiag > aos] 


0.25 
= P(Fi2¢ > 4) 
From the tables = Fp. (0.05) = 4 
So P(Fio ¢ > 4) = PUFej2 < 0.25) = 0.05 


Exercise (6C) 


1 


Find the upper 5% critical value for an F, ,-distribution in each of the following cases: 
a a=12,b=18 b a=4,b=11 c a=6,b=9 


Find the lower 5% critical value for an F,, ,-distribution in each of the following cases: 
a a=6,b=8 bd=25, b= 02 €.2=3,0=5 


Find the upper 1% critical value for an F, ,-distribution in each of the following cases: 
a a=12,b=18 b a=6,b= 16 ¢ 2=5,5=9 


Find the lower 1% critical value for an F,, ,-distribution in each of the following cases: 
a a=3,b=12 b a=8,b=12 © 2=5,2= (2 


Find the lower and upper 5% critical values for an F, ,-distribution in each of the following 
cases: 


a a=8,b=10 b a=12,b=10 © 723,025 


The random variable X follows an Fy \>-distribution. 
Find P(X < 0.5). 


The random variable X follows an F\, g-distribution. 


P 1 
Find Pe <X< 3.28} 


The random variable X has an F-distribution with 2 and 7 degrees of freedom. 
Find P(X < 9.55). 
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M9 The random variable X follows an F-distribution with 6 and 12 degrees of freedom. 


CP) 


a Show that P(0.25 < X < 3.00) = 0.9. 

A large number of values are randomly selected from an F-distribution with 6 and 12 degrees of 

freedom. 

b Find the probability that the seventh value to be selected will be the third value to lie between 
0.25 and 3.00. 


6.4 ) The F-test 


From Section 6.3, you know that if a random sample of 1, observations is taken from a normal 
distribution with unknown variance a? and an independent random sample of n,, observations is 
taken from a normal distribution with equal but unknown variance then 

S? 


ae yee ny-1 


2 
S; 


De 
If o2 = 0%, then you would expect se to be close to 1, but if of > a7 then you would expect it to be 


greater than 1. : 


You wish to test the null hypothesis Ho: o£ = af against the alternative hypothesis H;: 027 > of. 
S? 
As before when testing hypotheses, if your value of is such that it could only occur under the null 
iy 
hypothesis with a probability < p (typically p = 0.05) then you reject the null hypothesis; otherwise 
you have to conclude that there is insufficient evidence to reject the null hypothesis. 


To test whether two variances are the same, a simple set of rules can be followed. 
1 Find s? and s¢, the larger and smaller variances respectively. 
2 Write down the null hypothesis Hy: 07 = of 
3 Write down the alternative hypothesis H,: 0? > o (one-tailed) 
or Hy: 07 4 oF (two-tailed). 


4 Look up the critical value of F,, ,,, where v, is the number of degrees of freedom of the distribution 
with the larger variance and v, is the number of degrees of freedom of the distribution with the 
smaller variance. If a two-tailed test is used, p is halved (e.g. for a 10% significance level you would 
use F’ (0.05) as the critical value). 


Ups 
5 Write down the critical region. 
Sf 
S$ 


6 Calculate Fy. = 


7 See whether Fi. lies in the critical region and draw your conclusions. Relate these to the original 
problem. 


155 


_ ee 


Example 9) 


Two samples of sizes 13 and 9 are taken from normal distributions Y and Y with variances o2 and 


o ; respectively. The two samples give values s? = 24 and s; = 18. Test, at the 5% level, Hy: 0 = a} 


against Hy): 0% > 0} 


vy,=13-1= 12, ve=9-1=6-— 
S? = 24 and s? = 16 
The critical value is Fi2.g(0.05) = 3.26 


se 
The test statistic is eo 
RY 


Ss 


123 =3 28 
There is insufficient evidence to reject Ho; 


the two populations have equal variances. 


Example 


H 


Two samples of sizes 7 and 11 are taken from normal distributions Y and Y with variances 02 and 
o ; respectively. The two samples give values s? = 5 and s? = 25. Test, at the 5% level of significance, 
Ho: of = of against Hy: of < of 


fyet-1=10,  w=7-126 —4 
sp = 25 ands?=5 
The critical value is Fio.¢(0.05) = 4.06 


ge 
The test statistic is b= 2-5 
se |} 
5 > 4.06 
There is sufficient evidence to reject Ho, 


2 2 
So ag <a; 


Example 1) 


Two samples of sizes 11 and 13 are taken from normal distributions XY and Y with variances o 2 
and a} respectively. The two samples give values s? = 1.6 and s} = 2.4. Test, at the 10% level of 


significance, Ho: 02 = of against Hy: 02 4 oF 
t Watch out ) For a two-tailed F-test, there is 


still only one critical value. The only thing that 
changes is that the p-value is halved. 


yp=13-1= 12, 
sp = 2A ands? =1.6 
The critical value is Fiz ;9(0.05) = 2.91 


v.=11-1=10 


156 


2 


AY 


Lo 2.91 


the two variances are equal. 


Ss 
The test statistic is = = et = 15 
S 1.6 


There is insufficient evidence to reject Ho; 


——] 


Example 


Further hypothesis tests 


A manufacturer of wooden furniture stores some of its wood outside and some inside a special 
store. It is believed that the wood stored inside should have less variable hardness properties than 
that stored outside. A random sample of 25 pieces of wood stored outside was taken and compared 
to arandom sample of 21 similar pieces taken from the inside store, with the following results: 


Outside Inside 
Sample size 25 21 
Mean hardness (coded units) 110 122 
Sum of squares about the mean 5190 3972 


a Test, at the 0.05 level of significance, whether the manufacturer’s belief is correct. 


b State an assumption you made in order to do this test. 


a S isu = oye = 216.25 and S\uiae = 3972 


“oe 198.6 


v= 25 = = 24, hese) =) = 20" 
Ho: 06° = a7 yoo" oi" 
Critical value = Po4 20(0.05) = 2.08- 


_BiG25 
et SAG 


1.089 < 2.08, so there is insufficient evidence to reject Ho; 


F, 


= 1.069 


wood stored inside is just as variable in hardness as wood 
stored outside. 


b The assumption made is that the populations are normally 
distributed. 


Exercise (6D) 


1 Random samples are taken from two normally distributed populations. There are 11 
observations from the first population and the best estimate for the population variance is 
s* = 7.6. There are 7 observations from the second population and the best estimate for the 


population variance is s? = 6.4. 


Test, at the 5% level of significance, Ho: 07 = of against H,: 0? > 03. 
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Random samples are taken from two normally distributed populations. There are 25 
observations from the first population and the best estimate for the population variance is 

s? = 0.42. There are 41 observations from the second population and the best estimate for the 
population variance is s? = 0.17. 


Test, at the 1% significance level, Hp: «7 = a? against H,: a? > 0% 


The sample variance of the lengths of a sample of 9 tent-poles produced by a machine was 
63 mm/. A second machine produced a sample of 13 tent-poles with a sample variance of 
225 mm7?. Both these values are unbiased estimates of the population variances. 


a Test, at the 10% level, whether there is evidence that the machines differ in variability, 

stating the null and alternative hypotheses. (6 marks) 
b State the assumption you have made about the distribution of the populations in order 

to carry out the test in part a. (1 mark) 


Random samples are taken from two normally distributed populations. The size of the sample 
from the first population is n, = 13 and this gives an unbiased estimate for the population 
variance sj = 36.4. The figures for the second population are n, = 9 and s3 = 52.6. 


Test, at the 5% significance level, whether of = of or if of > of 


Dining Chairs Ltd are in the process of selecting a make of glue for using on the joints of their 
furniture. There are two possible contenders — Goodstick, which is the more expensive, and 
Holdtight, the cheaper of the two. 


The company are concerned that, while both glues are said to have the same adhesive power, one 
might be more variable than the other. 


A series of trials are carried out with each glue and the joints tested to destruction. The force in 
newtons at which each joint failed is recorded. The results are as follows: 


Goodstick: 10.3, 8.2, 9.5, 9.9, 11.4 

Holdtight: 9.6, 10.8, 9.9, 10.8, 10.0, 10.2 

a Test, at the 10% significance level, whether the variances are equal. (6 marks) 
b Which glue would you recommend and why? (1 mark) 


The closing balances, £x, of a number of randomly chosen bank current accounts of two 
different types, Chegrit and Dicabalk, are analysed by a statistician. The summary statistics are 
given in the table below. 


Sample size | Sx Dx? 
Chegrit 7 276 143 742 
Dicabalk 15 394 102 341 


Stating your hypotheses clearly, test, at the 10% significance level, whether the two distributions 
have the same variance. (You may assume that the closing balances of each type of account are 
normally distributed.) (6 marks) 
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Further hypothesis tests 


Bigborough Council wants to change the bulbs in their traffic lights at regular intervals so that 
there is a very small probability that any light bulb will fail in service. 


The council wants the length of time between light bulb changes to be as long as possible. 

They have obtained a sample of bulbs from another manufacturer, who claims the same bulb 
life as their present manufacturer. The council wishes therefore to select the manufacturer whose 
bulbs have the smallest variance. 


When they last tested a random sample of 9 bulbs from their present supplier the summary 
results were x = 9415 hours, Ux? = 9863 681, where x represents the lifetime of a bulb. 


A random sample of 8 bulbs from the prospective new supplier gave the following bulb lifetimes 
in hours: 


1002, 1018, 943, 1030, 984, 963, 1048, 994 
a Calculate unbiased estimates for the means and variances of the two populations. (2 marks) 
Assuming that the lifetimes of bulbs are normally distributed, 
b test, at the 10% significance level, whether the two variances are equal. (5 marks) 


c State your recommendation to the council, giving reasons for your choice. (1 mark) 


Mixed exercise Fé) 


1 


A random sample of 14 observations was taken of a random variable X which was normally 
distributed. The sample had mean x = 23.8 and variance s? = 1.8. 


Calculate: 
a a 95% confidence interval for the variance of the population 
b a 90% confidence interval for the variance of the population. 


A woollen mill produces scarves. The mill has several machines, each operated by a different 
person. Jane has recently started working at the mill and the supervisor wishes to check the 
lengths of the scarves Jane is producing. A random sample of 20 scarves is taken and the length, 
x cm, of each scarf is recorded. The results are summarised as: 

dox = 1428 Sox? = 102 286 
Assuming that the lengths of scarves produced by any individual follow a normal distribution, 
a calculate a 95% confidence interval for the variance o? of the lengths of scarves produced 


by Jane. (5 marks) 
The mill’s owners require that 90% of scarves should be within 10cm of the mean length. 
b Find the value of o that would satisfy this condition. (3 marks) 
c Explain whether the supervisor should be concerned about the scarves Jane is 

producing. (1 mark) 
a Define a confidence interval. (1 mark) 


A car rental company owner chooses 20 randomly selected dates and finds that the standard 
deviation of the number of cars rented is 3.75. 


b Given that the number of cars rented is normally distributed, find, correct to 3 significant 
figures, a 90% confidence interval for the population standard deviation. (5 marks) 
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The lengths of adult slow-worms, Xcm, are normally distributed with Y¥ ~ N(, 07). 


A random sample of 7 slow-worms is taken and their lengths, x cm, are recorded. The results are 
summarised as follows: 


Wx 225.9 Sox? = 7338.07 


Stating your hypotheses clearly, test, at the 5% level of significance, whether the standard 
deviation of the lengths of slow-worms is different from 2.7 cm. (6 marks) 


Giovanna is a pharmacist and is testing the amount of time it takes for a particular dosage of a 
drug to take effect on patients. She selects a random sample of 10 patients and records the time, 
in minutes, for the drug to take effect on each patient. 


10, 12, 13, 15, 17, 17, 19, 21, 22, 25 
a Calculate a 95% confidence interval for the standard deviation of the time taken for the 
drug to take effect. (6 marks) 


Given that the times taken are independent, 
b state what further assumption is necessary for this confidence interval to be valid. (1 mark) 


Giovanna requires the standard deviation of the times taken to be less than 3.1 minutes 
otherwise she must change the dosage. 


c Use your answer to part a to decide if the dosage needs changing. (1 mark) 


The maximum weight that 50 cm lengths of a certain make of string can hold before breaking 
(the breaking strain) has a normal distribution with mean 40 kg and standard deviation 5 kg. 
The manufacturer of the string has developed a new process which should increase the mean 
breaking strain of the string but should not alter the standard deviation. Ten randomly selected 
pieces of string are tested and their breaking strains, in kg, are: 


51, 48, 37, 46, 36, 53, 34, 49, 47, 50 


Stating your hypotheses clearly, test, at the 5% level of significance, whether the new process has 
altered the variance. (6 marks) 


The random variable X has an F-distribution with 5 and 10 degrees of freedom. 
Find values of a and b such that Pla = X¥ < b)=0.90. 


The standard deviation of the length of a random sample of 8 fence posts produced by a timber 
yard was 8mm. A second timber yard produced a random sample of 13 fence posts with a 
standard deviation of 14mm. 


a Test, at the 10% significance level, whether there is evidence that the lengths 
of fence posts produced by these timber yards differ in variability. State your 
hypotheses clearly. (6 marks) 


b State an assumption you have made in order to carry out the test in part a. (1 mark) 
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M9 The lengths, xmm, of the forewings of a random sample of male and female adult butterflies are 
(E/P) measured. The following statistics are obtained from the data. 


Number of butterflies | Sample mean x ye 
Females 7 50.6 17956.5 
Males 10 53.2 28 335.1 


Assuming the lengths of the forewings are normally distributed, test, at the 10% level 
of significance, whether the variances of the two distributions are the same. 


State your hypotheses clearly. (6 marks) 
= (2 
Challenge ; {Hint ) je ~ x2 
20 ee n=l 


a Given that Var(x2) = 2v, show that Var(S?) = ee where S? is the 


sample variance and o7 is the population variance. 


b Hence comment on the quality of S¢ as an estimator for a? as the 
sample size increases. 


Summary of key points 


1 Ifarandom sample of n observations X,, X3, ..., X,, is selected from N(u, 02), then 
(n — 1)S? 7 


Xn-1 
o2 


2 Generally, for a probability of a that the variance falls outside the limits, 


* the 100(1 — a)% confidence limits are 


(n — 1)s2 a (n — 1)s? 


sere) 


+ the 100(1 — a)% confidence interval for the variance of a normal distribution is 


(n—1)s? (n—-1)s2 
Cat) x(t a] 


3 To carry out a hypothesis test for for the variance of a normal distribution, follow these steps. 


* Write down the null hypothesis (H,). 

* Write down the alternative hypothesis (H,). 
- Specify the significance level. 

* Write down the degrees of freedom v. 

* Write down the critical region. 


- Identify the population variance, 0, and the unbiased estimate, s2, and calculate the value of 
(n — 1) s2 
o2 
* Complete your test and state your conclusions, stating whether H, is accepted or rejected, 
and interpreting this in the context of the question. 


the test statistic 
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The F-distribution has two parameters, (n, — 1) and (n, — 1) and is usually denoted by F,, 1. 
For a random sample of n, observations from an N(y,, 2) distribution and an independent 
random sample of n, observations from an N(,, of) distribution, 


Se/ae 


el 
2 2 x 4) Ny 
Sila, 


If a random sample of n, observations is taken from a normal distribution with unknown 
variance a* and an independent random sample of n,, observations is taken from a normal 
distribution with equal but unknown variance, then 

Se 

Pepe 1 -1,n,- 

S? 1, n,-1 
A critical value F,, ,, (p) which is exceeded with probability p is called an upper critical 
value for p. The value F,,, ,, (1 — p) for which the observation is less than F,, ,, (1 — p) with 
probability p is called a lower critical value for p. 


Palio 
mea) =p) 


To test whether two variances are the same, a simple set of rules can be followed: 


v2 


Fv, 0) = 


+ Find s7 and s2, the larger and smaller variances respectively. 
+ Write down the null hypothesis Ho: 07 = 0? 


* Write down the alternative hypothesis H,: 0? > a? (one-tailed), 
or H,: 0? £ o (two-tailed). 

- Look up the critical value of F,,,,, where v; is the number of degrees of freedom of the 
distribution with the larger variance and v, is the number of degrees of freedom of the 
distribution with the smaller variance. If a two-tailed test is used, p is halved (e.g. for a 10% 
significance level you would use F,,,, (5%) as the critical value). 

» Write down the critical region. 


2 


S 
> Calculate Fy = + 
Ss 


* See whether F;,<; lies in the critical region and draw your conclusions. Relate these to the 
original problem. 


After completing this chapter you should be able to: 


e@ Find a confidence interval for the mean of a normal distribution with 


unknown variance 


€ pages 164-170 


@ Conduct a hypothesis test for the mean of a normal distribution with 


unknown variance 
@ Carry out a paired ¢-test 


€ pages 170-174 
€ pages 174-179 


@ Find a confidence interval for the difference between means from two 

independent normal distributions with equal but unknown variances < pages 180-184 
e@ Conduct a hypothesis test for the difference between means from two 

independent normal distributions with equal but unknown variances < pages 185-189 


Farmers often try out different diets to see 
which is the most effective at producing 
high-yield animals. They can compare the 
effectiveness of two diets using a paired 
fatesia 


— Mixed exercise Q13 


1 Arandom sample of size 20 is taken from 


& 
\ 


a normally distributed population with a 
standard deviation of 2. The mean of the 
sample was 16. 


Find a 95% confidence interval for the 


mean jU. © Section 5.2 


A researcher is comparing the heights of 
children in two towns. A random sample 
of 100 children from town A is taken and 
the sample mean and standard deviation 
are 145cm and 4m respectively. An 
independent random sample of 120 
children from town B is taken and the 
sample mean and standard deviation are 
146 cm and 3.5 cm respectively. 


Test, at the 5% level of significance, 
whether there is evidence of a difference 
in the mean heights of the children in the 
two towns. € Section 5.3 
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GE Mean of a normal distribution with unknown variance 


You know that for a normally distributed random variable X, the sample mean will also be distributed 
normally: 


= ot 
ed Nu) 
This means that you can calculate a confidence interval for the population mean, yu, as long as 


you know the population variance. In most cases, however, if you are taking a sample from a large 
population you will not already know the population variance. 


If the sample size, n, is large, then you can use 


the sample variance as an approximation of the I ee a 


; F : alle ; : 
population variance. population, is approximately normal with 
However, if 7 is small, Sis unlikely to be va 

’ — distribution N(O, 12). € Section 5.4 

=p 
very close to o and can no longer be 
vn 
modelled by the normal distribution N(0, 12). 
When n is small we usually use the symbol ¢ to denote the quantity 
vn 


= If arandom sample X,, X;, ..., X,, is selected from a normal distribution with mean jz and 
unknown variance o? then 

X-p 

- Ss yes (Sexi € Section 5.1 


a ~n-1\a 


has a ¢,_,-distribution where S? is an 
unbiased estimator of o2. 


There are a family of t-distributions determined by the value of n. This establishes the number of 
degrees of freedom, similar to the chi-squared and F-distributions. 


The number of degrees of freedom, v, is equal tom — 1 and as v — oo, the ¢-distribution approaches 
the distribution N(0, 1°). For this reason the ¢-distribution is usually used when the sample size, n, is 
small. For larger sample sizes it is more convenient to approximate ¢ with a normal distribution. The 
diagram below shows two examples of the ¢-distribution for different values of v, together with the 
standardised normal distribution. 


A Standard normal curve 


wah W. S. Gosset, who published his works under the 
pseudonym ‘the student’, first investigated 


X-p 


the probability distribution of for a sample 


vn 
— taken from a normal distribution. The resulting 
432-1012 34 ! distribution is known as ‘Student's ¢-distribution’, 
or more commonly just the ¢-distribution. 
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Confidence intervals and tests using the ¢-distribution 


As with the F-distribution and the x? 
distribution, the critical values of the 
t-distribution depend on the number of degrees 
of freedom. The table of values in the formulae booklet, and on page 217, gives percentage points for 
the ¢-distribution for certain values of v up to 120. 


Note ) The t-distribution is symmetric in the same 
way as the normal distribution. 


The values in the table are those which a random variable with Student's ¢-distribution on v degrees 
of freedom exceeds with the probability shown. 


0.10 0.05 0.025 0.01 0.005 


1415 | 1.895,| 2365 | 2.998 | 3.499 I U0 


1.397 1.860 2.306 2.896 3.355 


Vv 

: a ANTS Bie 12-708 | 31,821 | 63/057 For example, if X has the ¢,-distribution 
2 1.886 | 2.920 | 4.303 | 6.965 | 9.925 with 7 degrees of freedom (n = 8): 

3 1.638 [Reem 3.182 | 4.541 | 5.841 e P(X > 1.895) = 0.05 

4 1.533 | 2.132 | 2.776 | 3.747 | 4.604 | -—— © P(X <1.895)=0.95 

5 1.476 | 2.015 | 2.571 | 3.365 | 4.032 and by the symmetry of the 

6 1.440 | 1.943 | 2.447 | 3.143 | 3.707 t-distribution: 

7 

8 


Note ) You may be able to use your calculator to 
find the percentage points for the t-distribution 
rather than the tables. 


t Watch out | When working with the 


t-distribution you are strongly advised to 
draw appropriate diagrams so that you are 
sure in your own mind which areas under the 
t-distribution you are dealing with. 


4 3-2 +1 0 1 
1.895 


' Online ) Expore the ¢-distribution using cy 


GeoGebra and use it to determine critical 
values of the sample variances. 


The random variable X has a f-distribution with 10 degrees of freedom. Determine values of t¢ for 
which: 


a P(X> 1) = 0.025 


b P(X <2) =0.95 

7 Notation |) |X| means the modulus of X. This 
¢ P(X <1) = 0.025 is the absolute value of X ignoring the sign, for 
d P(LX|> 72) = 0.05 example the modulus of —5, written |-5| is 5. So 
e P(X] <2) =0.98 P(IX|> 1) = P(X <-1) +P(X>2) 


P(ILX| < t) =P(-t< X¥< 2) 
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b 


to(0.025) = 2.228 - 


© 
eS) 
O 
Ol 
nV 


if PY <1) = O95 then PX >i) =1- 095 =0:05 
From the table t,.)(0.05) = 1.612 - 


From a, PX > ft) = 0.025 when t = 2.228 


sO F(X) = O025 tf t==2.228: * 


A 
wd | Sep 
O 


From a andc, P(|X| > t) = 0.05 if X¥ < -2.228 and X > 2.228 
There are therefore two values for t and they are 


nV 


-2.226 and 2.228 - 


9 
eS) 
O 
@ 
~Y 


P(IX1 > t) = 0.01 if t = 2.764 and -2.764 < X < 2.764 
Again there are two values of t and they are —2.764 and 2.764 


a 
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You are looking for 

P(X > ft) so you can use the 
tables directly. Look where 
the v = 10 row intersects 
with the 0.025 column. 


The whole area under the 
curve = 1. 

So PLY > = 1-P(X < 2). 
Look where the v = 10 row 
intersects with the 0.05 
column. 


Because the distribution is 
symmetrical 

P(X <—1) = P(X > 2). You 
know from part a that 
POeSi) 01025 iit 12-228; 
so P(X < ft) = 0.025 when 
t= -0.228. 


This is two-tailed with 
probability of 0.025 at 
each tail. 


Again, a two-tailed 
problem. From the 
diagram you can see that 
you are looking for tails 
each with probability 0.01. 


Confidence intervals and tests using the ¢-distribution 


Example 


The random variable Y has a t,-distribution. 
Determine: 
a P(Y> 3.747) b P(Y <—-2.132) 


From the v = 4 row of the table you can see that 


hota [3.747 is in the 0.01 probability column. 
P(Y > 3.747) = 0.01 + 


b P(Y> 2132) = 0.05 
By symmetry P(Y < -2.132) = 0.05 


From the v = 4 row of the table you can see that 
2.132 is in the 0.05 probability column. 


You can use the t-distribution to find a confidence interval for the mean of a normal distribution 
when the variance is unknown. 


For a sample taken from a normal population with unknown variance, 
X-p 
Ss 
vn 
If you want to find a 95% confidence interval for the population mean, y, then you start by finding a 
value of ¢ such that 


r= 


has a t,_,-distribution. 


X-U X - 
P >t/=0.025 and P < —-t}]= 0.025 
Ss S 
vn vn 
This value of ris called the ¢,_, value for a Notation | This value written ae G025) 
Ties . 


probability of 0.025. You can find it using tables 
or your calculator. 


P< t,.(0.025)| = (1 — 0.025) — 0.025 = 0.95 


Thus pf -100025) < 


Ss 
\ vn 
Look at the inequality inside the bracket: You are interested in ju, so here 
4, (0025) % 22 F pe 4,(0028) x. you try to isolate it by 
vn vn dc eantmiiaeee 
5 = ae multiplying y= 
~ty-a(0.025) x = — ¥ <-1 < ty1(0.025) x —— X n 
_ vn e = i 2 subtracting ¥ 
X ~ t,-1(0.025) x a << X +1, 4(0.025) x aa 3 multiplying by —1 and altering 


; : — ; ae lity. 
For a particular sample with mean xX and variance s?, this becomes: Hine quaNley 


P(¥ — ty 1(0.025) x 2 < pe <¥ + t,4(0.025) x =| = 0.95 
Vn vn 


So for a small sample of size 2 from a normal distribution N(y, 07) with unknown mean and variance: 


e the 95% confidence limits for yw are given by x + t,,,(0.025) x = 


vn 


e the 95% confidence interval for jz is given by 


[* ~ t,,(0.025) x , ¥ + t,.3(0.025) x 4 
vn Vn 
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In the same way, 


¢ the 90% confidence limits for w are given by X + f,,(0.05) x 
n 
e the 90% confidence interval for ju is given by 
¥ — t,1(0.05) x 2, ¥ + t,-1(0.05) x 2 
( Vn Vn 


= In general, for a small sample of size 7 from a normal distribution N(j, 07) with unknown 
mean and variance: 
¢ the 100(1 -— a)% confidence limits for the population mean are 
x+ eal) x 
vn 


¢ the 100(1 -— a)% confidence interval for the population mean is 


[¥ tea (Si x ~ ¥+ (5) x A 


Example 


A sample of 6 trout taken from a fish farm were caught and their lengths in centimetres were 
measured. The lengths of the fish were as follows: 


26.8 26.0 25.8 23.5 24.3 24.6 
Assuming that the lengths of trout are normally distributed, find a 90% confidence interval for the 
mean length of trout in the fish farm. 


Using a calculator gives ¥ = 25.5 and s? = 0.8560 +—— First find the sample mean and variance. 
s = 0.6560 = 0.9252 - 


The 90% confidence limits for x are 


The standard deviation of the sample can 


0.9252 '— be found by taking the square root of the 


af S 
X + t.(5%) fie ee Ol variance. 
= 25.5 O76 


The 90% confidence interval is (24.739, 26.261) 


Put your values for x and s into the formula, 
and work out the confidence interval. 


Example 


The percentage starch content of potatoes is normally distributed with mean y. In order to assess 
the mean value of the starch content, a random sample of 12 potatoes is selected and their starch 
content measured. The percentages of starch contents obtained were as follows: 


242 203. 18:6 20.0 20.8 216 194 187 22.1 195 21.3 22.6 


Find a 95% confidence interval for the mean. 


X = 20.675 ands = 1.513 = You could use a calculator to find these. 
The 95% confidence limits for x are 
= s 1.513 
X + t,(2.5%) = = 20.675 + 2.201 x -—— Use the formula. 
me fe) ini VID 
= 20.675 + 0.961 
The 95% confidence interval is (19.714, 21.636) -—_———— Write out the confidence interval. 
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Exercise 7A) 


1 


Given that the random variable X has a f,,-distribution, find values of ¢ such that: 


a P(X <1) =0.025 b PX s7=0,05 ce P(LXl> 2) =0.95 
Find: 
a t(0.01) b ty6(0.05) 


The random variable Y has a ¢,-distribution. Find a value (or values) of t that satisfies each 
of the following. 


a n=10,P(¥ <1) =0.95 b n= 32, P(¥ <1) =0.005 c n=5,P(Y¥ <1) =0.025 
d n= 16, P(IY| <1) =0.98 e n=18,P(IY1 > =0.10 


A test on the life (in hours) of a certain make of torch batteries gave the following results. 
20.3 17,3 25.0 18.4 16.3 24.8 24.3 21,2 


Assuming that the lifetime of batteries is normally distributed, find a 90% confidence interval 
for the mean. 


A sample of size 16 taken from a normal population with unknown variance gave the following 
sample values. 


x=124 e=21)) 


Find a 95% confidence interval for the population mean. 


The mean heights (measured in centimetres) of six male students at a college were as follows: 
182 178 183 180 169 184 


Calculate: 

a a 90% confidence interval b a 95% confidence interval 
for the mean height of male students at the college. 

You may assume that the heights are normally distributed. 


The masses (in grams) of 10 nails selected at random from a bin of 90 mm long nails were: 

oF 10.2 11.2 9.4 11.0 11.2 9.8 9.8 10.0 11.3 
a Calculate a 98% confidence interval for the mean mass of the nails in the bin. (6 marks) 
b State one assumption you have made in your calculation. (1 mark) 


A random sample of the feet of 8 adult males gave the following summary 
statistics of length x (in cm): 
Six = 224.1 hee = 6337.39 


Assuming that the length of men’s feet is normally distributed, calculate a 99% 
confidence interval for the mean length of men’s feet based upon these results. (6 marks) 
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™ 9 Arandom sample of 26 students from the sixth form of a school sat an intelligence test that 
(E) measured their IQs. The results are summarised below. 


x = 122 e225 


Assuming that IQ is normally distributed, calculate a 95% confidence interval for the mean 
IQ of the students. (6 marks) 


(E/P) 10 Add ticks to this table to show the distribution you would use when finding a confidence 
interval. 


Normal x2 t 


For the population mean, using a sample of size 50 from a 
population of unknown variance 


For the population mean, using a sample of size 6 from a 
population of known variance 


For the population variance, using a sample of size 20 


(3 marks) 


(E/P) 11 A company manufactures light bulbs which they state have an average lifespan of 500 days. 
The manager is concerned that the production process is faulty and that the light bulbs do not 
last as long as stated. 

He tests a random sample of 15 bulbs and finds their lifespan, x days. The data is summarised as 


De = 7338 vx? = 3 618 260 
a Explain why you need to use the ¢-distribution to find a confidence interval for the 
population mean. (1 mark) 
b Find a 90% confidence interval for the mean lifespan, y days, of the light bulbs. (6 marks) 
c State one assumption you have made in finding your answer to part b. (1 mark) 
d Find a 95% confidence interval for the population variance. (4 marks) 


Hint ) Confidence intervals for the population 
variance are covered in Chapter 6. ¢ Section 6.1 


7.2 ) Hypothesis test for the mean of a normal distribution with unknown variance 


Apart from using the ¢-distribution rather than the normal distribution for finding the critical region, 
testing the mean of a normal distribution with unknown variance follows the same steps as you used 
when testing the mean of a normal distribution with known variance. 


The following steps might help you in answering questions about hypothesis testing of the mean of a 
normal distribution with unknown variance. 


1 Write down H,. 

2 Write down H,. 

3 Specify the significance level, a. 

4 Write down the number of degrees of freedom, v. 
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5 Write down the critical region. 
6 Calculate x, s* and ¢ using 
De 5. 2 oa 


re 
x ==, s? = ———__ for s? = J and t= 
n n-1 n-1 


7 Conclusions 
The following points should be addressed: 
i Is the result significant or not? 
ii What are the implications in terms of the original problem? 


A shopkeeper sells jars of jam. The weights of the jars of jam are normally distributed with a mean 
of 150g. A customer complains that the mean weight of 8 jars she had bought was only 147 g. 

An estimate for the standard deviation of the weights of the 8 jars of jam calculated from the 

8 observations was 2 g. 

a Test, at the 5% significance level, whether 147 g is significantly less than the quoted mean. 

b Discuss whether the customer has cause for complaint. 


a He v= 150 Hee < 150 State your hypotheses and write down the 
Significance level = 0.05 (one-tailed test) significance level. 


y=8-1=7 — Find the number of degrees of freedom. 


Look up the critical value in the table on page 217. 
Note a minus sign is needed since a left-hand tail 


From tables, the critical value t7 is -1.695 — 
so the critical region is t < —1.695 » 


is being used. 
X¥ = 147, w= 150, 5 =2 | 
a '— Write down the critical region. 
a = , = -4,2426 z 
vn V8 | Calculate x and s. Use these to calculate 7. 


Now -4.2426 < -1.895 so the result is 


Draw a conclusion. 
significant and Ho is rejected. 


b There is evidence to suggest that the Put it in the context of the original problem. 


mean weight is less than 150g and the 
customer does have a cause for complaint. 
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Example 


The temperature (°C) was measured at noon on 10 days during the month of March in West 
Cumbria. The readings were: 


12.8 11.4 22 13,1 15.4 13.3 14.9 15.0 16.0 15.8 


Using a 5% significance level, test whether or not this is an increase over the previous year when the 
average noon temperature was 13.5°C. 


State your hypotheses 

Here 13.5 Ape > 135 | and write down the 
Significance level 5% | significance level. 
yea'D 
From tables, the critical value is tg = 1.833 | Write down the 
so the critical region is t = 1.633 | critical region. 
= = 12.64 114 412.9 415.14 15.4 + 13.5 + 14.9 + 15.0 + 16.0 + 15.6 
. 10 Calculate x and s. 

= 14.28 Note both of these 

Dox? — nk® 2060.28 - 10 x 14.282 see yeas 
s= = = 2.344 using a calculator. 
n= 10 - 1 
s= 1.531 4 
i mh = 48522 = 1611+ Calculate ¢. 
vn v10 


1.611 < 1.633, so the result is not significant. 
There is not enough evidence to suggest that the average temperature 


Draw a conclusion. 
has increased. 


A concrete manufacturer tests cubes of its concrete at regular intervals, and their compressive 
strengths in Nmm-~ are determined. The mean value of the strengths is required to be 0.47 N mm”. 
A new supplier of cement offers to supply the firm at a cheaper rate than the present supplier, and 
a trial bag of cement is used to make 12 concrete cubes. Upon testing, these cubes are found to 
have strengths (x) such that }>x =5.52 and }\x? = 2.542. Assume that the strengths are normally 
distributed. 


a Stating your hypotheses clearly, test, at the 5% level of significance, whether or not the use of the 
new cement has altered the mean strength of the concrete. 


b In the light of your conclusion to the test in part a, what would you recommend the 
manufacturer to do? 
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a. het e047 Hy: pp 4 OA7= 5) You are looking to see if the strength 
pe P=aieo4 has altered up or down so use F in Hj. 


Probability in each tail = 0.025 + 
From tables the critical value is 2.201 
The critical region is |t| 2 2.201 


This is a two-tailed test so halve 
‘— the significance level to find the 
probability in each tail. 


xX 
ger f eOsoam 
i 25 | Since you are given ©x and %x* in the 
a Dux? — nx _ 2542 — 12 x O46* _ 59992545. question, use these formulae. 
n= 11 ‘ 
s= 0.016 
t=" O46 =047 oe 
- — =- . Since x < p, tis negative. 
t we OOIG 2.165 Le 8 
vn V2 
Now |-2165| < |-2.201| + t lies between —2.201 and 2.201. 
The result is not significant. There is not | 
enough evidence to suggest that the Draw the conclusion in context. 


mean strength has altered. = 


b Since the mean strength has not altered, the ; 
Base your recommendation on your 


conclusion. 


manufacturer should accept the new supplier because 


they are cheaper. The two values —2.165 and —2.201 
are quite close, however, and a one-tailed test of 
whether or not the strength had decreased should be 


done, or failing this a further sample could be taken. 


Exercise 


! 


1 Given that the observations 9, 11, 11, 12, 14, have been drawn from a normal distribution, test 
Ho: « = 11 against Hy: > 11. Use a 5% significance level. 


2 A random sample of size 28 taken from a normally distributed variable gave the sample values 
x = 17.1 and s? =4. Test Hp: uw = 19 against H;: pw < 19. Use a 1% level of significance. 


3 A random sample of size 13 taken from a normally distributed variable gave the sample values 
xX = 3.26 and s* = 0.64. Test Ho: ys = 3 against H,: uw 4 3. Use a 5% significance level. 


(E/P) 4 A certain brand of blanched hazelnuts for use in cooking is sold in packets. The weights of the 
packets of hazelnuts follow a normal distribution with mean, j4. The manufacturer claims that 
= 100g. A sample of 15 packets was taken and the weight, x, of each was measured. 

The results are summarised by the following statistics: )°x = 1473, )>x? = 148 119. 


a Explain why it is not suitable to use a normal approximation for a . in this 

instance. aa (1 mark) 
b Test, at the 5% significance level, whether or not there is evidence to justify the 

manufacturer’s claim. (7 marks) 
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A manufacturer claims that the lifetimes of its 100-watt bulbs are normally distributed with a 
mean of 1000 hours. A laboratory tests 8 bulbs and finds their lifetimes to be 985, 920, 1110, 
1040, 945, 1165, 1170 and 1055 hours. 


Stating your hypotheses clearly, examine whether or not the bulbs have a longer mean 
lifetime than that claimed. Use a 5% level of significance. (7 marks) 


A fertiliser manufacturer claims that by using brand F fertiliser the yield of fruit bushes will 

be increased. A random sample of 14 fruit bushes was fertilised with brand F and the resulting 
yields, x, were summarised by }_x = 90.8, )-x? = 600. The yield of bushes fertilised by the usual 
fertiliser was normally distributed with a mean of 6 kg of fruit per bush. 

Test, at the 2.5% significance level, the manufacturer’s claim. (7 marks) 


A nuclear reprocessing company claims that the amount of radiation within a reprocessing 
building in which there had been an accident had been reduced to an acceptable level by their 
clean-up team. The amounts of radiation, x, at 20 sites within the building in suitable units are 
summarised by )\x = 21.7, }>x? = 28.4. In the same units, the acceptable level of radiation is 
given as 1.00. 


a By carrying out a suitable test for the population mean, test whether the building falls within 
acceptable radiation levels. (7 marks) 


b State one assumption made in carrying out your test. (1 mark) 


Scores in an aptitude test are assumed to be normally distributed with a population mean of 
100. A company claims to be able to train people to improve their scores in the test. A random 
sample of 20 people is taken and they are trained before taking the test. The sample standard 
deviation is found to be 15 and the mean of the scores of the 20 people is found to be 110. 


a Test, at the 5% level of significance, whether there is evidence of the training improving 

the scores of participants. State your hypotheses clearly. (5 marks) 
b Test, at the 10% level of significance, the hypothesis that the population standard 

deviation is different from 12. State your hypotheses clearly. (5 marks) 


CE The paired ¢-test 


There are many occasions when you might want to compare results before and after some treatment, 
or the effectiveness of two different types of treatment. You could, for example, be investigating the 
effect of alcohol on people’s reactions, or the difference in intelligence levels of identical twins who 
were separated at birth and who have been brought up in different family circumstances. 


In both cases you need to have a common link between the two sets of results, for instance by 
taking the same person’s result before and after drinking alcohol, or by the twins being identical. It 
is necessary to have this link so that differences caused by other factors are eliminated as much as 
possible. It would, for example, be of little use if you tested one person’s reactions without drinking 
alcohol and a different person’s reactions after drinking alcohol because any difference could be due 
to normal variations between their reactions. In the same way you would have to use identical twins 
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in the intelligence experiment, otherwise any difference in intelligence might be due to the normal 
variability of intelligence between different people. In these cases, each result in one of the samples is 
paired with a result in the other sample; the results are therefore referred to as paired. 


In paired experiments such as these you are not really interested in the individual results as such, 
but in the difference, D, between the results. In these circumstances you can treat the differences 
between pairs of matched subjects as if they were a random sample from a N(uu, 02) distribution. You 
can then proceed as you did for a single sample. 


Although you do not need to assume the two populations are normal, you need to assume that the 
differences are normally distributed. Given that you are unlikely to know o? and that 7 is likely to be 


small, then 
D- 
t= a os tn-1 
vn Note } This is the null hypothesis 
Taking H,: 4p = 0 as your null hypothesis, this reduces to that on average there is no 
= difference between the two 
f= D=0 ae ae populations. 
va 
= Ina paired experiment with a mean of the differences between the samples of D, 
D - wp 
S ox tha 
va 


The paired t-test proceeds in almost the same way as the f-test itself. The steps are given below. 
1 Write down the null hypothesis Ho. 

2 Write down the alternative hypothesis H;. 

3 Specify a. 

4 Write down the degrees of freedom (remembering that v = n — 1). 

5 Write down the critical region. 

6 Calculate the differences d. 

Calculate d and s?. = 


Pam, d — Lp 
Calculate the value of the test statistic ¢ = rm 


vn 
7 Complete the test and state your conclusions. As before, the following points should be addressed: 


i Is the result significant or not? 
ii What are the implications in terms of the original problem? 
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Example 


In an experiment to test the effects of alcohol on the reaction times of people, a group of 10 


students took part in the experiment. The stud 
pushing a switch that would switch it off again 
After the students had each drunk one pint of 
shown below. 


ents were asked to react to a light going on by 
. Their reaction times were automatically recorded. 
beer the experiment was repeated. The results are 


Student A B;C;D)|)E|;]F/}G|H/{I J 

Reaction time before (seconds) | 0.8 | 0.2 | 0.4 | 0.6 | 0.4 | 0.6 | 0.4 | 0.8] 1.0 | 0.9 
Reaction time after (Seconds) | 0.7 | 0.5 | 0.6 | 0.8 | 0.8 | 0.6 | 0.7 | 0.9 | 1.0 | 0.7 
Difference -0.1} 0.3 | 0.2/0.2) 0.4] 0 | 0.3} 0.1) 0 | -0.2 


Test, at the 5% significance level, whether or not the consumption of a pint of beer increased the 


students’ reaction times. 


Ho: Ma = O Ay: La >O 


Significance level = 0.05 (one-tailed test) —— 


________. State your hypotheses. 


—— Write down the significance level. 


yv=10-1=9- 


Critical value to(5%) = 1.633 + 


Find the number of degrees of freedom. 


Look up the critical value in the table. 


The critical region is t 2 1.633 


d 
2H NP gas 
n 10 


yd? — nd? 


g2 


Write down the critical region. 


Calculate d and s?. 


n—-1 
0.48 - 10(0.12) 
9g 
= 0103/3338 


ONE 3 4B 84G 


Calculate the value of the test statistic ¢ = 


~ (0.037333 
10 


1.9640 > 1.633. The result is significant: 


a pint of beer increased the students’ 
reaction times. 


reject Ho. There is evidence that consuming ———— 


=e 


Always state whether you accept or reject Hy and 
draw a conclusion (in context if possible). 


176 


Confidence intervals and tests using the ¢-distribution 


Example 


In order to compare two methods of measuring the hardness of metals, readings of Brinell 
hardness were taken using each method for 8 different metal specimens. The resulting Brinell 


hardness readings are given in the table below. 


Material Reading method A Reading method B 
Aluminium 29 31 
Magnesium alloy 64 63 
Wrought iron 104 105 
Duralumin 116 119 

Mild steel 138 140 

70/30 brass 156 156 

Cast iron 199 200 

Nickel chrome steel 385 386 


Use a paired f-test, at the 5% level of significance, to test whether or not there is a difference in the 


readings given by the two methods. 


Ho: Ha = O Ay: La - O 


Probability in each tail = 0.025 + 


yv=6-1=7 


LS 


Critical value t7(2.5%) = 2.365 » 


The critical regions are t < —2.365 and 
t> 2.365 


i 


The differences, d, are 2, -1, 1, 3, 2,O,1and1| 
Yid=9 Yid=21 
= 9 
as a 125 
Soa? - nd? 
ge 
n- 1 
_ 21 -— 8(1.125)2 
~ e 
= 1.554 _| 
fe 1.125 -O 
v1.554 
Ve 
= 2.553 


The t value is significant; there is sufficient 
evidence to reject the null hypothesis. There 
is a difference between the mean hardness 


readings using the two methods. 


State your hypotheses. 


This is a two-tailed test so halve the significance 
level to find the probability in each tail. 


Find the number of degrees of freedom. 
Look up the critical value in the table on page 217. 


Write down the critical regions. 


Calculate d, d and s?. 


d = [Up 


Calculate the value of the test statistic t = 


ale 


Always state whether you accept or reject Hy and 
draw a conclusion (in context if possible). 
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Exercise 7) 


A 
1 


(E/P) 


It is claimed that completion of a shorthand course has increased the shorthand speeds of the 
students. 
a If the suggestion that the mean speed of the students has not altered is to be tested, 

write down suitable hypotheses for which 

i atwo-tailed test would be appropriate 

ii a one-tailed test would be appropriate. (2 marks) 


The table below gives the shorthand speeds of students before and after the course. 


Student A|B|)C|]D|E| F 
Speed before in words/minute | 35 | 40 | 28 | 45 | 30 | 32 
Speed after 42 | 45 | 28 | 45 | 40 | 40 


b Carry out a paired f-test, at the 5% significance level, to determine whether or not 
there has been an increase in shorthand speeds. (7 marks) 


A large number of students took two General Studies papers that were supposed to be of equal 
difficulty. The results for 10 students chosen at random are shown below. 

Candidate | 4 | B| C| D| E| F| G/| Al] TI J 
Paper 1 18 | 25 | 40 | 10 | 38 | 20 | 25 | 35 | 18 | 43 
Paper 2 20 | 27 | 39 | 12 | 40 | 23 | 20 | 35 | 20 | 41 


The teacher looked at the marks of the random sample of 10 students, and decided that Paper 2 
was easier than Paper 1. 


Given that the marks on each paper are normally distributed, carry out an appropriate 
test of the teacher’s claim, at the 1% level of significance. (7 marks) 


It is claimed by the manufacturer that by chewing a special flavoured chewing gum, smokers are 
able to reduce their craving for cigarettes, and thus cut down on the number of cigarettes smoked 
per day. In a trial of the gum on a random selection of 10 people, the no-gum smoking rate and 
the smoking rate when chewing the gum were investigated, with the following results. 


Person A B C|D)E|)F|)G/)AT TI J 
Without-gum smoking rate (cigs./day) | 20 | 35 | 40 | 32 | 45 | 15 | 22 | 30 | 34 | 40 
With-gum smoking rate (cigs./day) 15 | 25 | 35 | 30 | 45 |] 15 | 14) 25 | 28 | 34 
a Use a paired t-test at the 5% significance level to test the manufacturer’s claim. (7 marks) 
b State any assumptions you have had to make. (1 mark) 


A town council is going to put a new traffic management scheme into operation in the hope that 
it will make travel to work in the mornings quicker for most people. Before the scheme is put 
into operation, 10 randomly selected workers are asked to record the time it takes them to come 
into work on a Wednesday morning. After the scheme is put into place, the same 10 workers 

are again asked to record the time it takes them to come into work on a particular Wednesday 
morning. 
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The times in minutes are shown in the table below. 

Worker A B Cc D E F G H I J 
Before 23 37 53 42 39 60 54 85 46 38 
After 18 35 49 42 34 48 52 79 37 37 


Test, at the 5% significance level, whether or not the journey time to work has decreased. 
(7 marks) 


A teacher wants to test the idea that students’ results in mock examinations are good predictors 
for their results in actual examinations. He selects 8 students at random from those doing a mock 
Statistics examination and records their marks out of 100. Later he collects the same students’ 
marks in the actual examination. The resulting marks are as follows: 


Student A B Cc D E F G A 
Mock examination mark 35 86 70 91 45 64 78 38 
Actual examination 45 77 81 86 53 71 68 46 
a Use a paired t-test to investigate whether or not the mock examination is a good 
predictor. (Use a 10% significance level.) (7 marks) 
b State any assumptions you have made. (1 mark) 


The manager of a dress-making company took a random sample of 10 of his employees and 
recorded the number of dresses made by each. He discovered that the number of dresses made 
between 3.00 and 5.00 p.m. was fewer than the same employees achieved between 9.00 and 
11.00a.m. He wondered whether a tea break from 2.45 to 3.00 p.m. would increase productivity 
during these last two hours of the day. 

The numbers of dresses made by these workers in the last two hours of the day before and after 
the introduction of the tea break were as shown below. 


Worker A B C D E F G H vs J 
Before 75 73 75 81 74 13 a 75 75 72 
After 80 84 719 84 85 84 78 78 80 83 


a Why was the comparison made for the same 10 workers? (1 mark) 


b Conduct, at the 5% level of significance, a paired t-test to see whether the introduction 
of a tea break has increased productivity between 3.00 and 5.00 p.m. (7 marks) 


A drug administered in tablet form to help people sleep and a placebo were given for two weeks 
to a random sample of eight patients in a clinic. The drug and the placebo were given in random 
order for one week each. The average numbers of hours sleep that each patient had per night 
with the drug and with the placebo are given in the table below. 

Patient 1 2 3 4 5 6 7 8 
Hours of sleep with drug 10.5 | 6.7 8.9 6.7 9.2 | 10.9 | 11.9 | 7.6 
Hours of sleep with placebo 10.3 | 6.5 9.0 5.3 8.7 7.5 9.3 7.2 


Test, at the 1% level of significance, whether or not the drug increases the mean number of hours 
sleep per night. State your hypotheses clearly. (7 marks) 
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7.4 | Difference between means of two independent normal distributions 


You need to be able to find a confidence interval . : 
In Section 5.3, you carried out 


for the difference between two means from 


; pega te ; hypothesis tests for the difference between 
independent normal distributions with equal 


the means of normal distributions with 

but unknown variances. known variances. In that case the population 
distributions could have different variances. The 
techniques of this section and the next section 
only apply to normal distributions with unknown 
but equal variances. 


To do this, you need to find a pooled estimate 
of variance. 

Suppose that you take random samples from 
random variables X and Y that have a common 
variance, a*. You will have two estimates of a”, namely s? and s/. A better estimate of o7 than either s? 
or s¢ can be obtained by pooling the two estimates. You will recall that, for a single sample, an 
unbiased estimate of the population variance was given by 


,, ee 
= 


A similar idea works for two pooled estimates. You have 


fe», go DAE 
x ni, 1 yn l 


so that (7, — 1)s2 = )o(x — X)* and (n, - 1)s? = )o(y — 9). 


These are the sums of the squares of the differences of each sample value from the sample mean. 
You can add them together to get a total sum of squares of differences: 


Le Pe = 7) = = leet y= 1s? 
You can use this sum to calculate a pooled estimate, s*, of o*: 
Ma Vseth=1s W-lseto Hse 


s= = 
G1) 4,= 1) ht n= 2 


= If arandom sample of 7, observations is taken from a normal distribution with unknown 
variance o?, and an independent sample of 1, observations is taken from a normal 
distribution that also has unknown variance o?, then a pooled estimate for o7 is 


(n, - 1)s?2 + (n, - 1)s? 


= 


z ny, +n,-2 
5 tan, x" , doy? - ny? 
where s? = ——————_ and Sy = ——_.— 
n,.-1 ny,-1 


Notice that if 7, =n, =n, this reduces to s? = On aS 2 5 
Fl _— 


the mean of the two variances. The pooled estimate of variance is really a 
weighted mean of two variances with the two weights being (7, — 1) and (n, — 1). 


which is 
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Example 


! 


A random sample of 15 observations is taken from a population and gives an unbiased estimate 
for the population variance of 9.47. A second random sample of 12 observations is taken from 
a different population that has the same population variance as the first population, and gives 
an unbiased estimate for the variance as 13.84. Calculate an unbiased estimate of the population 
variance o” using both samples. 


>_ (4 x 9.47) + (11 X 13.64) es, es 
Sy = : Use eee ee Gey EE 
(n,— 1) + (ny - 1) 


14 + 11 
= 1.3926 


In Section 5.4, you saw that if the sample sizes are large then 


(XY — Y) _ (Ux oa Hy) 


2 2 
Sx + Sy 
ny. ny 


is approximately normal with distribution N(0, 1°). 


When the sample sizes are small you need to make three assumptions: 


1 that the populations alk normal In many cases, it is reasonable to assume 

2 that the samples are independent that the variances of the populations are equal. If 

3. that the variances of the two populations you are unsure, you can use the F-distribution to 
are equal. test for equal variance. € Section 6.4 


The third assumption enables you to pool the two sample variances to find an estimator for the 
common variance: 


(a, = 182 + (1,,— 1) SF 
a (ig = 1) (n= 1) 
Substituting S? for Sé and S? gives 
x — Y) ~ (=) _ (X - Y) 7 =a) 
S282 - 1 
A 2 S;, ie + ny 


ha” ny 

Now, because the sample sizes are small, this will not as before follow a N(0, 12) distribution. 
You have already seen that in the single-sample case 

Xp, 

S 
(ny 

follows a t-distribution, so you will not be surprised to find that 

(X — Y) - (x. - Hy) 


1 1 
S, i 


also follows a ¢-distribution. 
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There are (7, + n,) observations in the total sample and two calculated restrictions (namely the 
means X and Y), so the number of degrees of freedom will be 2, + Hy 2. 


= If arandom sample of 7, observations is taken from a normal distribution that has 
unknown variance o?, and an independent sample of 1, observations is taken from a normal 
distribution with equal variance, then 


(X - Y) -(n.-4,) (nm, — 1) S? + (1m, — 1) S? 
see a ~ th +n,-2 Where S? = sd id 
1,1 sien n,+n,-—2 

Spe+d 
x My 


You can now use tables of values for the t-distribution to find a confidence interval for ju, — p1,. 
For example, for a 95% confidence interval you would start by finding the value ¢, that is exceeded 
with probability 0.025. This would give you: 

P(-t.< tiy+ny-2 < t.) = 0.95 

= 7) = (ie pi) 

Pl-t.< Z a <t,| = 0.95 
jt 4 
nt ny 


Sp 


1.1 SS 1 1 
P(- tS) ny ta <O —Y) — (ix My) < beSp ite) =095 


The confidence limits for (ju, — j1,) are therefore given by 


1 1 
(© —Y) £ lSplq_t 7. 


y 


and the confidence interval is 


Tad = 1,1 
(e- y) Sen, * 1,’ ©-P+t5/P+2) 


= The confidence limits for the difference between two means from independent normal 
distributions, X and Y, when the variances are equal but unknown are given by 


1 #1 
(x - “Wt tSoln tn, 


where s, is the pooled estimate of the population variance, and r, is the relevant value taken 
from the 7-distribution tables. 


= The confidence interval is given by 


= 1 #1 = 1.1 
(e-H-t5)/ 242, (x -y) +1,5, 342) 
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Zina survey on the petrol consumption of cars, a random sample of 12 cars with 2-litre engines was 
compared with a random sample of 15 cars with 1.6-litre engines. The following results show the 
consumption, in suitable units, of the cars: 

2-litre cars: 34.4, 32.1, 30.1, 32.8, 31.5, 35.8, 28.2, 26.6, 28.8, 28.5, 33.6, 28.8 

1.6-litre cars: 35.3, 34.0, 36.7, 40.9, 34.4, 39.8, 33.6, 36.7, 34.0, 39.2, 39.8, 38.7, 40.8, 35.0, 36.7 
Calculate a 95% confidence interval for the difference between the two mean petrol consumption 
figures. You may assume that the variables are normally distributed and that they have the same 
variance. 


For the 2-litre engine, n, = 12, y = 30.933, s? = 8.177 
For the 1.6-litre engine, n, = 15, X = 37.04, s? = 6.894 


ge = (14 X 6.894) + (1X 6.177) | (i se ia Wis, 
s 25 Ge a 
= 7.459 
ji,=7aeo 275) 
te = to5(2.5%) = 2.060° y=12415-2=25 


The contidence limits are 
(37.04 — 30.933) + 2.060 x 2.731\75 + = 6.107 + 2.179 
= 6.286 and 3.928 
The 95% confidence interval is (3.928, 8.286) 
Or (3.93; O29) to: 3 Sif; 


Exercise 7D) 


(E) 1 A random sample of 10 toothed winkles was taken from a sheltered shore, and a sample of 
15 was taken from a non-sheltered shore. The maximum basal width, x mm, of the shells was 
measured and the results are summarised below. 


-—— Use (¥ - 9) # t.5)/7+ 


Sl= 


Sheltered shore: x = 25,52 =4 

Non-sheltered shore: X = 22, s? = 5.3 
a Find a 95% confidence interval for the difference between the means. (6 marks) 
b State an assumption that you have made when calculating this interval. (1 mark) 


(E/P) 2 A packet of plant seeds was sown and, when the seeds had germinated and begun to grow, 
8 were transferred into pots containing a soil-less compost and 10 were grown on in a soil-based 
compost. After 6 weeks of growth, the heights, x, in cm of the plants were measured with the 
following results. 
Soil-less compost: 9.3, 8.7, 7.8, 10.0, 9.2, 9.5, 7.9, 8.9 
Soil-based compost: 12.8, 13.1, 11.2, 10.1, 13.1, 12.0, 12.5, 11.7, 11.9, 12.0 


a Assuming that the populations are normally distributed, and that there is a difference between 
the two means, calculate a 90% confidence interval for this difference. (6 marks) 


b State an additional assumption you have used when calculating this interval and discuss 
whether this assumption is reasonable in the context given. (2 marks) 
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M3 Forty children were randomly selected from all 12-year-old children in a large city to compare 
E/P) two methods of teaching the spelling of 50 words which were likely to be unfamiliar to the 
children. Twenty children were randomly allocated to each method. Six weeks later the children 
were tested to see how many of the words they could spell correctly. The summary statistics 
for the two methods are given in the table below, where X is the mean number of words spelled 
correctly, s is an unbiased estimate of the variance of the number of words spelled correctly and 
nis the number of children taught using each method. 


x sv 
Method A | 32.7/ 6.12 | 20 
Method B | 38.2 | 5.22} 20 


n 


a Calculate a 99% confidence interval for the difference between the mean numbers 


of words spelled correctly by children who used Method B and Method A. (6 marks) 
b State two assumptions you have made in carrying out part a. (2 marks) 
c Interpret your result. (1 mark) 


(E) 4 The table below shows summary statistics for the mean daily consumption of cigarettes by a 
random sample of 10 smokers before and after their attendance at an anti-smoking workshop 
with x representing the mean and s? representing the unbiased estimate of population variance 
in each case. 


x sy? 


Mean daily consumption before the workshop | 18.6 | 32.488 10 
Mean daily consumption after the workshop 14.3 | 33.344 10 


Stating clearly any assumption you make, calculate a 90% confidence interval for the 
difference in the mean daily consumption of cigarettes before and after the workshop. (7 marks) 


(E/P) 5 Two farmers add different protein supplements to the feed of cows to increase the yield of milk. 
A sample of 8 cows is taken from the first farmer, who uses supplement A, and the yield of milk 
is measured. A second sample, of size 7, is taken from the cows of the second farmer, who uses 
supplement B. The table shows the mean daily yield, in litres, and unbiased estimates for the 
population variance in each case. 


x s? n 
Supplement 4 | 24.5] 1.2 | 8 
Supplement B | 26.8} 1.6 | 7 


a Stating your hypotheses clearly, test, at the 10% level of significance, the hypothesis that 
there is a difference in the variability of the yields. State any assumptions you make. (5 marks) 


The farmers wish to find a confidence interval for the difference in the average milk yield for the 
two supplements. 


b Explain how the result from part a can be used to justify the use of a ¢-distribution to find the 
confidence interval. (1 mark) 


c Find, correct to 3 significant figures, a 95% confidence interval for the difference in 
the average milk yield. (5 marks) 
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7.5) Hypothesis test for the difference between means 


Apart from using the ¢-distribution rather than the normal distribution for finding the critical values, 
testing the difference between means of two independent normal distributions with unknown 
variances follows similar steps to those used for testing the difference of means when the variances 
are known. 


The following steps might help you in answering questions on the difference of means of normal 
distributions when the variances are unknown. 


1 Write down Ho. 

Write down H,. 

Specify the significance level, a. 

Write down the number of degrees of freedom, v. 
Write down the critical region. 


Calculate the sample means and variances, x, y, sé and sf. 


nun QQ uu BF W N 


Calculate a pooled estimate of the variance: 


=e = Ds 


se = 


Hh hye 


8 Calculate the value of t: 


9 Complete the test and state your conclusions. The following points should be addressed: 
i Is the result significant? 
ii What are the implications in terms of the original problem? 


Two groups of students, X and Y, were taught by different teachers. At the end of their course, 
a random sample of students from each class was selected and given a test. The test results out of 
50 were as follows: 

Group X: 40 37 45 34 30 41 42 43 36 

Group Y: 38 43 36 45 35 44 41 


The headteacher wishes to find out if there is a significant difference between the results for these 

two groups. 

a Write down any assumptions that need to be made in order to conduct a difference of means test 
on this data. 

b Assuming that these assumptions apply, test at the 10% level of significance whether or not there 
is a significant difference between the means. 
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a The assumptions that need to be made are 
that the two samples come from normal 
distributions, are independent and that the 
populations from which they are taken have the r— State your hypotheses. 


same variances. 
This is a two-tailed test. Halve the 


b Ho: fly = by Ft fy F by» ne Sine 
Pocbeablity Wea tal = 0.05 « ne level to find the probability in 
p= Q+7-2=14 * sabe 

itical value t,4(O. is 1.761 + 
Critical value Oe ene fi Find the number of degrees of freedom 


(n, + nz — 2 in this case). 


Look up the critical value in the table. 


area area 
0.05 0.05 
> 

“L76i O ety 


Write down the critical region. There are two 
regions as it is a two-tailed test. 


The critical regions are t < -1.761 and t = 1.761 »—— 
Using a calculator gives 

n, = 9, ¥ = 38.667, s2 = 23.0 | 
ny=7, VY = 40.286, sf? =15.9 | 
(8 X 23) + (6 X 15.9) . 


Calculate x, y, 5° and s/. 


a Calculate a pooled estimate of the variance 


P 9+7-2 
= 19.957 L— (ny - I) sf + (ny - 1) 55 
using —--————--—-—— 
So $,= 4467 Bec? y= 2 
fe 38.667 — 40.266 : 
1 1 = 
4AG7 [5+4 (X — Y) — (wx — My) 
9 7 Calculate ¢ Ce ————— 
= -0.719 mal Spin + i, 
-1.761 < -0.719 < 1.761 50 the result is not [x — fly = 0 from hypothesis. 


significant. Accept Ho. On the evidence given 


by the two samples there is no difference Always state whether you accept or reject Ho 
between the means of the two groups. and draw a conclusion (in context if possible). 


Example © 


A random sample of the heights, in cm, of sixth form boys and girls was taken with the following 
results: 
Boys’ heights: 152, 148, 147, 157, 158, 140, 141, 144 
Girls’ heights: 142, 146, 132, 125, 138, 131, 143 
a Carry out a two-sample t-test at the 5% significance level on these data to see whether the mean 
height of boys exceeds the mean height of girls by more than 4cm. 


b State any assumptions that you have made. 
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Let x be the height of a boy and y be the 
height of a girl. 


a Ho: By = by +4 Ay: by = by +4 


Significance level = 0.05 (one-tailed test) 


v=6+7-2=13° 
Critical value t,3(5%) = 1.771 


The critical region is t = 1.771 _| 
Using a calculator gives: 


2.057 is in the critical region and so there 
is sufficient evidence to reject the null 
hypothesis. The mean height of boys exceeds 

the mean height of girls by more than 4.cm. 
b The assumptions made are that the two 


samples are independent, that the variances 


of both populations are equal and that the 
populations are normally distributed. 


For the boys: ¥ = 148.375, s2 = 46.554, 
hy=o 
For the girls: y = 136.714, s? = 57.905, 
ny, = 7 a 
go a (7 X 46.554) + (6 x 57.905) | 
64-722 
= 51.793 
So S$, = 7197 
_ (148.375 - 136.714) -4 | 
7197 a +5 
= 2.057 
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State your hypotheses and write down the 
significance level. 


Find the number of degrees of freedom. 


Look up the critical value in the table and write 
down the critical region. 


Calculate x, y, s? ands. 


Calculate a pooled estimate of the variance 
ise ne ls. 


usin 
e ee AP By = 2 


ey) = by) 
late 


[x — ply = 4 from hypothesis. 


Calculate ¢ using 


Always state whether you accept or reject Hy and 
draw a conclusion (in context if possible). 


1 A random sample of size 20 from a normal population gave x = 16, s* = 12. 
A second random sample of size 11 from a normal population gave x = 14, s? = 12. 
a Assuming that both populations have the same variance, write down an unbiased estimate for 


that variance. 


(i mark) 


b Test, at the 5% level of significance, the suggestion that the two populations have the same 


mean. 


(5 marks) 


Salmon reared in Scottish fish farms are generally larger than wild salmon. A fisherman measured 


the length of the first 6 wild salmon caught from a river. Their lengths in centimetres were: 


42.8 40.0 38.2 37.5 37.0 36.5 


Chefs prefer wild salmon to fish-farmed salmon because of their better flavour. A chef was 
offered 4 salmon that were claimed to be wild. Their lengths in centimetres were: 


42.0 43.0 41.5 40.0 


Use the information given above and a suitable f-test at the 5% level of significance to help 
the chef to decide if the claim is likely to be correct. You may assume that the populations are 


normally distributed. 


(8 marks) 
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In order to check the effectiveness of three drugs against the E. coli bacillus, 15 cultures of the 
bacillus (five for each of three different antibiotics) had discs soaked in the antibiotics placed in 
their centre. The 15 cultures were left for a time and the area in cm? per microgram of drug where 
the E. coli was killed was measured. The results for the three different drugs are given below. 


Streptomycin: 0.210 0.252 0.251 0.210 0,256 0.253 
Tetracycline: 0.123 0.090 0.123 0.141 0.142 0.092 
Erythromycin: 0.134 0.120 0.123 0.210 0.134 0.134 
a It was thought that tetracycline and erythromycin seemed equally effective. Assuming that the 
populations are normally distributed, test this at the 5% significance level. (8 marks) 


b Streptomycin was thought to be more effective than either of the other two drugs. Treating the 
other two as being a single sample of 12, test this assertion at the same level of significance. 
(7 marks) 


To test whether a new version of a computer programming language enabled faster task 
completion, the same task was performed by 16 programmers, divided at random into two 
groups. The first group used the new version of the language, and the time for task completion, 
in hours, for each programmer was as follows: 

49 63 96 5.2 4.1 7.2 4.0 
The second group used the old version, and their times were summarised as follows: 

n=9, dS xaT7l2, S397 = 60492 
a State the null and alternative hypotheses. (1 mark) 
b Perform an appropriate test at the 5% level of significance. (7 marks) 
In order to compare like with like, experiments such as this are often performed using the same 
individuals in the first and the second groups. 
c Give a reason why this strategy would not be appropriate in this case. (1 mark) 


A company undertakes investigations to compare the fuel consumption, x, in miles per gallon, 
of two different cars, the Volcera and the Spintono, with a view to purchasing a number as 
company cars. 
For a random sample of 12 Volceras the fuel consumption is summarised by 

You = 384 Sov? = 12480 
A statistician incorrectly combines the figures for the sample of 12 Volceras with those of a 
random sample of 15 Spintonos, then carries out calculations as if they are all one larger sample 
and obtains the results py = 34 and s? = 23. 


a Show that, for the sample of 15 Spintonos, x = 534 and }\x? = 19330. (2 marks) 
b Given that the variance of the fuel consumption for each make of car is 0”, obtain an 
unbiased estimate for 0. (3 marks) 


c Test, at the 5% level of significance, whether there is a difference between the mean fuel 
consumption of the two models of car. State your hypotheses and conclusion clearly. (7 marks) 


d State any further assumption you made in order to be able to carry out your test in part c. 


(1 mark) 
e Give two precautions which could be taken when undertaking an investigation into the fuel 
consumption of two models of car to ensure that a fair comparison is made. (2 marks) 
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AW 


A group of scientists is experimenting with different fertilisers. Fertiliser A is given to a crop of 
potatoes and a sample of 11 plants is taken. The weight of potatoes, in kg, from each plant is 
measured. Fertiliser B is given to a second crop of potatoes and a sample of 13 plants is taken. 
The weight of potatoes, in kg, from each plant is also measured. The table shows the mean 
weight of potatoes and unbiased estimates for the population variance in each case. 


x s2 n 
Fertiliser A 42.1 2.1 11 
Fertiliser B 46.3 3.3 13 


a Stating your hypotheses clearly, test, at the 10% level of significance, the hypothesis that there 
is a difference in the variability of the weights. State any assumptions you make. (5 marks) 


The scientists wish to test if there is a difference in the average weight of potatoes for each fertiliser. 
b Explain how the result in part a can be used to justify the scientists using a two-sample 


t-test to test their hypothesis. (1 mark) 
c Stating your hypotheses clearly, test, at the 5% level of significance, whether there is a 
difference in the average weight of potatoes. (5 marks) 


Challenge 


For samples of sizes n, and n, from populations Y and Y with equal but 
unknown variance o2, show that the pooled sample variance 
(Wiens Ub = 

a Bg SP By = 2 

is an unbiased estimator of o*. You may assume that the sample 
variances S, and S, are each unbiased estimators for 0%. 


Mixed exercise 


1 


A random sample of 14 observations is taken from a normal distribution. The sample has a 
mean X = 30.4 and a sample variance s? = 36. 


It is suggested that the population mean is 28. Test this hypothesis at the 5% level of significance. 


A random sample of 8 observations is taken from a random variable X that is normally 
distributed. The sample gave the following summary statistics: 


yr aos Sox = 8s 
The population mean is thought to be 10. Test this hypothesis against the alternative hypothesis 
that the mean is greater than 10. Use the 5% level of significance. 


Six eggs selected at random from the daily output of a brood of hens had the following weights 


in grams: 

55 50 33 33 82 34 
Calculate 95% confidence intervals for: Hint ) For part b, use the chi-squared 
a thieiesn (5 marks) distribution. < Section 6.1 
b the variance of the population from which these eggs were taken. (5 marks) 
c What assumption have you made about the distribution of the weights of eggs? (1 mark) 
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A sample of size 18 was taken from a random variable X which was normally distributed, 
producing the following summary statistics: 


X¥=9.8 s?=0.49 
Calculate 95% confidence intervals for: 


a the mean 


b the variance of the population. 


A manufacturer claims that the lifetime of its batteries is normally distributed with mean 21.5 
hours. A laboratory tests 8 batteries and finds the lifetimes of these batteries to be as follows: 


19.7 184 22.2 208 169 23.3 23.2 21.1 


Stating clearly your hypotheses, examine whether or not these lifetimes indicate that the batteries 
have a shorter mean lifetime than that claimed by the manufacturer. 
Use a 5% level of significance. (6 marks) 


A diabetic patient monitors his blood glucose in mmol/l at random times of the day over several 
days. The following is a random sample of the results for this patient. 


5.1 3.8 6.1 68 62 3.1 63 66 61 7.9 58 6.5 
Assuming the data to be normally distributed, calculate a 95% confidence interval for: 
a the mean of the population of blood glucose readings (6 marks) 
b the standard deviation of the population of blood glucose readings. (6 marks) 


The level of blood glucose varies throughout the day according to the consumption of food and 
the amount of exercise taken during the day. 


c Comment on the suitability of the patient’s method of data collection. (1 mark) 


In order to discover the possible error in using a stopwatch, a student started the watch and 
stopped it again as quickly as she could. The times taken in centiseconds for 6 such attempts are 
recorded below: 


10 13 14 10 13 9 
Assuming that the times are normally distributed, find 95% confidence limits for: 
a the mean (6 marks) 
b the variance. (6 marks) 


A manufacturer claims that the car batteries which it produces have a mean lifetime of 24 
months, with a standard deviation of 4 months. A garage selling the batteries doubts this claim 
and suggests that both values are in fact higher. 


The garage monitors the lifetimes of 10 randomly selected batteries and finds that they have a 
mean lifetime of 27.2 months and a standard deviation of 5.2 months. 


Stating clearly your hypotheses and using a 5% level of significance, test the claim made by the 
manufacturer for: 


a the standard deviation (6 marks) 
b the mean. (6 marks) 
c State an assumption which has to be made when carrying out these tests. (1 mark) 
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m9 


) 


(E) 10 


The distance to takeoff from a standing start of an aircraft Distance (m) | Frequency 
was measured on twenty occasions. The results are summarised 700_ 3 

in the table. 

Assuming that distance to takeoff is normally distributed, as : 

find 95% confidence intervals for: ies 9 

a the mean (6 marks) 730- g 

b the standard deviation. (6 marks) eee I 

It has been hypothesised that the mean distance to takeoff is 725m. 

c Comment on this hypothesis in the light of your interval from part a. (1 mark) 


A company knows from previous experience that the mean time taken by maintenance 
engineers to repair a particular electrical fault on a complex piece of electrical equipment is 
3.5 hours, with a standard deviation of 0.5 hours. 


A new method of repair has been devised, but before converting to this new method the 
company took a random sample of 10 of its engineers and each engineer carried out a repair 
using the new method. The time, x hours, it took each of them to carry out the repair was 
recorded and the data are summarised below: 


Yox = 34.2 Sox? =121.6 
Assume that the data can be regarded as a random sample from a normal population. 
a For the new repair method, calculate an unbiased estimate of the variance. (2 marks) 
b Use your estimate from part a to calculate, for the new repair method, a 95% confidence 
interval for: 
i the mean 
ii the standard deviation. (10 marks) 
c Use your calculations and the given data to compare the two repair methods in order to 
advise the company as to which method to use. (2 marks) 


d Suggest an alternative way of comparing the two methods of repair using the 10 randomly 
chosen engineers. (1 mark) 


A random sample of 60 female raccoons is taken and their heights recorded. The sample mean is 

found to be 24cm and an unbiased estimate for the population variance is found to be 2.1 cm?. 

a Given that the underlying population is normally distributed, find a 90% confidence interval 
for the mean height of female raccoons. State clearly the approximating distribution you 
have used to determine this confidence interval. (5 marks) 

A second random sample of 6 male raccoons is taken and their heights recorded. The sample 

mean is found to be 27cm and an unbiased estimate for the population variance is found to 

be 2.7 cm, 

A hypothesis test is to be carried out to test if the mean height of male raccoons is greater 

than 25cm. 

b Explain why the approximating distribution used in part a is no longer valid when carrying 


out this test. (1 mark) 
c Test, at the 5% level of significance, the hypothesis that male raccoons have an average height 
greater than 25cm. State your hypotheses clearly. (5 marks) 
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A chemist has developed a fuel additive and claims that it reduces the fuel consumption of cars. 
To test this claim, 8 randomly selected cars were each filled with 20 litres of fuel and driven 
around a race circuit. Each car was tested twice, once with the additive and once without. 

The distances, in miles, that each car travelled before running out of fuel are given in the table 
below. 


Car 1 2 3 4 5 6 7 8 
Distance without additive 163 172 195 170 183 185 161 176 
Distance with additive 168 185 187 172 180 189 172 175 


Assuming that the distances travelled follow a normal distribution and stating your hypotheses 
clearly, test, at the 10% level of significance, whether or not there is evidence to support the 
chemist’s claim. (7 marks) 


A farmer set up a trial to assess the effect of two different diets on the increase in the weight of 
his lambs. He randomly selected 20 lambs. Ten of the lambs were given diet A and the other 

10 lambs were given diet B. The gain in weight, in kg, of each lamb over the period of the trial 
was recorded. 


a State why a paired f-test is not suitable for use with these data. (1 mark) 
b Suggest an alternative method for selecting the sample which would make the use of a paired 
t-test valid. (1 mark) 
c Suggest two other factors that the farmer might consider when selecting the 
sample. (2 marks) 


The following paired data were collected. 


Diet A 5 6 7 46 ) 6.1 | 5.7 | 6.2 | 7.4 5 3 
Diet B 7 72 8 6.4 | 5.1) 7.9 | 82 | 62 | 61 | 5.8 


d Using a paired t-test at the 5% significance level, test whether or not there is evidence of 
a difference in the weight gained by the lambs using diet A compared with those using 


diet B. (7 marks) 
e State, giving a reason, which diet you would recommend the farmer to use for his 
lambs. (1 mark) 


A medical student is investigating two methods of taking a person’s blood pressure. He takes a 
random sample of 10 people and measures their blood pressure using an arm cuff and a finger 
monitor. The table below shows the blood pressure for each person, measured by each method. 
Person A B Cc D E F G A I J 
Arm cuff 140 | 110 | 138 | 127 | 142 | 112 | 122 | 128 | 132 | 160 
Finger monitor | 154 | 112 | 156 | 152 | 142 | 104 | 126 | 132 | 144 | 180 


a Use a paired t-test to determine, at the 10% level of significance, whether or not there 
is a difference in the mean blood pressure measured using the two methods. State your 
hypotheses clearly. (7 marks) 


b State an assumption about the underlying distribution of measured blood pressure 
required for this test. (1 mark) 


Confidence intervals and tests using the ¢-distribution 


m 15 
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The weights, in grams, of mice are normally distributed. A biologist takes a random sample of 
10 mice. She weighs each mouse and records its weight. 


The 10 mice are then fed on a special diet. They are weighed again after two weeks. 


Their weights in grams are as follows: 


Mouse A B Cc D E F G H I J 


Weight before diet 50.0 | 48.3 | 47.5 | 54.0 | 38.9 | 42.7 | 50.1 | 46.8 | 40.3 | 41.2 
Weight after diet 52.1 | 47.6 | 50.1 | 52.3 | 42.2 | 44.3 | 51.8 | 48.0 | 41.9 | 43.6 


Stating your hypotheses clearly, and using a 1% level of significance, test whether or not the 
diet causes an increase in the mean weight of the mice. (7 marks) 


A hospital department installed a new, more sophisticated, piece of equipment to replace an 
ageing one in order to speed up the treatment of patients. The treatment times of random 
samples of patients during the last week of operation of the old equipment and during the first 
week of operation of the new equipment were recorded. The summary results, in minutes, are 
shown in the table. 


n | Sox | Sox? 
a Show that the values of s? for the old and Old equipment 10 | 225 | 5136.3 
new equipment are 8.2 and 14.5 respectively. New equipment | 9 | 234 | 6200.0 


(2 marks) 
Stating clearly your hypotheses, test: 


b whether the variance of the times using the new equipment is greater than the variance of 
the times using the old equipment, using a 5% significance level (6 marks) 


c whether there is a difference between the mean times for treatment using the new 
equipment and the old equipment, using a 2% significance level. (6 marks) 


d Find 95% confidence limits for the mean difference in treatment times between the new 
and old equipment. (5 marks) 


Even if the new equipment would eventually lead to a reduction in treatment times, it might be 
that to begin with treatment times using the new equipment would be higher than those using 
the old equipment. 


e Give one reason why this might be so. (1 mark) 
f Suggest how the comparison between the old and new equipment could be 

improved. (1 mark) 
Two different drugs designed to increase the red blood cell = @ a 


count are administered to two groups of patients. A sample 
of 25 patients who took the first drug, A, is taken and the red 
blood cell count, in million cells per microlitre, is recorded. 

A sample, of size 19, is then taken from the patients who took drug B. The table shows the 
mean red blood cell count and unbiased estimates for the population variance in each case. 


Drug A 5.9 | 2.6 | 25 
Drug B 4.8 1.7 19 


a Stating your hypotheses clearly, test, at the 10% level of significance, the hypothesis 
that there is a difference in the variability of the red blood cell counts. 
State any assumptions you make. (5 marks) 
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Doctors wish to find a confidence interval for the difference in the mean red blood cell counts 
for the two drugs. 


b With reference to your answer to part a, comment on the suitability of using a t-distribution 


to find this confidence interval. (1 mark) 
c Find, correct to 3 significant figures, a 95% confidence interval for the difference in the mean 
red blood cell count. (5 marks) 


Challenge 


Three independent random samples of sizes n,,, and n, (where n,,n,,n_ > 1) 
are taken from three populations XY, Y and Z respectively, where X, Y and Z 
have equal but unknown variance o%. 

a Given that the sample variances S,’, S,? and S.* are unbiased estimators 
for o°, find a pooled estimator S, for a* based on all three samples, giving 
your answer in terms of n,, n,,n., S,2, S,° and S,?. 

b Show that the estimator found in part a is unbiased. 


Summary of key points 


1 Ifarandom sample Xj, X>,... , X,, is selected from a normal distribution with mean yz and 
unknown variance o* then 
Xp 
Ss 


vn 
has a t,_;-distribution where S2 is an unbiased estimator of 0%. 


2 In general, for a small sample of size n from a normal distribution N(j, 02) with unknown mean 
and variance: 


* the 100(1 — a)% confidence limits for the population mean are 


rat) 


* the 100(1 — a)% confidence interval for the population mean is 
S 


= a AY = a 
(F-malS) xz F+ta(Z)x Fl 
3 Ina paired experiment with a mean of the differences between the samples of D, 
D- Up 


Una 


Sx 
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4 |f arandom sample of n, observations is taken from a normal distribution with unknown 
variance o*, and an independent sample of 1, observations is taken from a normal distribution 
that also has unknown variance o%, then a pooled estimate for o? is 


(ie =) early Ws 


s2 
e n,+Nny—2 
: ox? — 1,x? 3 Se 
wierd s.=—__, ands, = 
Dn = Il ify = Il 


5 Ifa random sample of n,. observations is taken from a normal distribution that has unknown 
variance o*, and an independent sample of n, observations is taken from a normal distribution 
with equal variance, then 


(X - 0) ae (it, = fy) 


© te ae 
1 1 ian 
SiS 
ee Ti, 
(n,. — 1)S2 + (n, — 1) S2 
where Sf = Z z 
iene 


6 The confidence limits for the difference between two means from independent normal 
distributions, X and Y, when the variances are equal but unknown are given by 


ae 1 il 
(x -y) £15 e+ 2 


where s, is the pooled estimate of the population variance, and 7, is the relevant value taken 
from the ¢-distribution tables. 


The confidence interval is given by 


== 1 1 = 1 il 
(GT) - tye, OF Vers; ate) 
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The time, in minutes, it takes Robert 

to complete the puzzle in his morning 
newspaper each day is normally 
distributed with mean 18 and standard 
deviation 3. After taking a holiday, 
Robert records the times taken to 
complete a random sample of 15 puzzles 
and he finds that the mean time is 16.5 
minutes. You may assume that the 
holiday has not changed the standard 
deviation of times taken to complete the 
puzzle. 


Stating your hypotheses clearly, test, at 
the 5% level of significance, whether or 
not there has been a reduction in the 
mean time Robert takes to complete the 
puzzle. (6) 
< SM2, Chapter 3 


In a trial of diet A a random sample of 
80 participants were asked to record 
their weight loss, x kg, after their first 
week of using the diet. The results are 
summarised as follows: 


Sx = 361.6  Six2= 1753.95 


a Find unbiased estimates for the mean 
and variance of weight lost after the 
first week of using diet A. (3) 


The designers of diet A believe it can 
achieve a greater mean weight loss after 
the first week than an existing diet B. 

A random sample of 60 people used diet 
B. After the first week they had achieved 
a mean weight loss of 4.06 kg, with an 
unbiased estimate of variance of weight 
loss of 2.50 kg?. 


b Test, at the 5% level of significance, 
whether or not the mean weight loss 


Review exercise 


after the first week using diet A is 
greater than that using diet B. State 
your hypotheses clearly. (6) 


c Explain the significance of the central 
limit theorem to the testin part b. (1) 


d State an assumption you have made in 
carrying out the test in part b. (1) 
< Sections 5.1, 5.4 


A random sample of the daily sales (in £s) 
of a small company is taken and, using 
tables of the normal distribution, a 99% 
confidence interval for the mean daily 
sales is found to be (123.5, 154.7). 


Find a 95% confidence interval for the 
mean daily sales of the company. (6) 
€ Section 5.2 


A machine produces metal containers. 
The masses of the containers are 
normally distributed. A random sample 
of 10 containers was taken and the mass 
of each container was recorded to the 
nearest 0.1 kg. The results were as follows: 

49.7 50.3 51.0 49.5 49.9 

50.1 50.2 50.0 49.6 49.7 
a Find unbiased estimates of the mean 

and variance of the masses of the 

population of metal containers. (3) 
The machine is set to produce metal 
containers whose masses have a 
population standard deviation of 0.5kg. 
b Find: 

i a95% confidence interval 

ii a 99% confidence interval 

for the population mean. (5) 

€ Sections 5.1, 5.2 


M™ 5 The drying times of paint can be assumed 


E/P) 


6 


7 


to be normally distributed. A paint 
manufacturer paints 10 test areas with a 
new paint. The following drying times, to 
the nearest minute, were recorded: 


82 98 140 110 90 
125. 150 = 130 70 ~=110 


a Calculate unbiased estimates for 
the mean and the variance of the 
population of drying times of this 
paint. (3) 

Given that the population standard 

deviation is 25, 


b find a 95% confidence interval for the 

mean drying time of this paint. (5) 
Fifteen similar sets of tests are done and 
the 95% confidence interval is determined 
for each set. 


c Find the probability that all 15 of 
these confidence intervals contain the 
population mean. (2) 

€ Sections 5.1, 5.2 


Some biologists were studying a large 
group of wading birds. A random sample 
of 36 were measured and the wing length, 
xmm, of each wading bird was recorded. 
The results are summarised as follows: 


S°x= 6046 Sox?= 1016338 


a Calculate unbiased estimates for the 
mean and the variance of the wing 
lengths of these birds. (3) 


Given that the standard deviation of the 
wing lengths of this particular type of 
bird is actually 5.1 mm, 


b find a 99% confidence interval for the 
mean wing length of the birds from 
this group. (3) 


€ Sections 5.1, 5.2 


A sociologist is studying how much junk 

food teenagers eat. A random sample of 

100 female teenagers and an independent 
random sample of 200 male teenagers 


A 
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were asked to estimate their weekly 
expenditure, x, in pounds, on junk food. 
The results are summarised below. 


n x s 
nen 100 | 5.48 | 3.62 
teenagers 
Nice 200 | 686 | 4.51 
teenagers 


a Using a 5% significance level, test 
whether or not there is a difference 
in the mean amounts spent on junk 
food by male teenagers and female 
teenagers. State your hypotheses 


clearly. (7) 
b Explain briefly the importance of 

the central limit theorem in this 

problem. (1) 


€ Section 5.4 


A computer company repairs large 
numbers of PCs and wants to estimate 
the mean time taken to repair a particular 
fault. Five repairs are chosen at random 
from the company’s records and the times 
taken, in seconds, are as follows: 


205 310 405 195 320 


a Calculate unbiased estimates of 
the mean and the variance of the 
population of repair times from which 
this sample has been taken. (3) 


It is known from previous results that 
the standard deviation of the repair 
time for this fault is 100 seconds. The 
company manager wants to ensure that 
there is a probability of at least 0.95 that 
the estimate of the population mean lies 


within 20 seconds of its true value. 
b Find the minimum sample size 
required. 


(5) 


€ Sections 5.1, 5.2 


A random sample of 15 tomatoes is taken 
and the mass, x grams, of each tomato 

is found. The results are summarised as 
follows: 
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Sox? = 2962 


A >ox = 208 
a Assuming that the masses of the 
tomatoes are normally distributed, 
calculate the 90% confidence interval 
for the variance o7 of the masses of the 


tomatoes. (6) 

b State with a reason whether or not 
the confidence interval supports the 
assertion o = 3, 


(1) 


€ Section 6.1 


A machine is filling bottles of milk. A 
random sample of 16 bottles was taken 
and the volume of milk in each bottle was 
measured and recorded. The volume of 
milk in a bottle is normally distributed and 
the unbiased estimate of the variance, s’, 
of the volume of milk in a bottle is 0.003. 


a Find a 95% confidence interval for the 
variance of the population of volumes 
of milk from which the sample was 
taken. 

The machine should fill bottles so that 

the standard deviation of the volumes is 

equal to 0.07. 

b Comment on this with reference to 
your 95% confidence interval. (1) 

€ Section 6.1 


(6) 


(E/P) 11 A mechanic is required to change car 
tyres. An inspector timed a random 
sample of 20 tyre changes and calculated 
the unbiased estimate of the population 


variance to be 6.25 minutes’. 


a Test, at the 5% significance level, 
whether or not the standard deviation 
of the population of times taken by 
the mechanic is greater than 2 minutes. 
State your hypotheses clearly. (7) 


b State one assumption you have made 
in carrying out this test. (1) 
< Section 6.2 


The random variable XY has an 
F-distribution with 10 and 12 degrees of 
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freedom. Find a and b such that 
P(a < X¥ <b) =0.90. (3) 


€ Section 6.3 


The random variable X has an 
F-distribution with 8 and 12 degrees of 
freedom. 


(3) 


€ Section 6.3 


' 1 
Find P(e < XY < 2.85). 


A beach is divided into two areas, A and 
B. A random sample of pebbles is taken 
from each of the two areas and the length 
of each pebble is measured. A sample 

of size 26 is taken from area A and the 
unbiased estimate for the population 
variance is s7, = 0.495 mm. A sample 

of size 25 is taken from area B and the 
unbiased estimate for the population 
variance is s7 = 1.04 mm/’. 


a Stating your hypotheses clearly, test, at 
the 10% significance level, whether or 
not there is a difference in variability 
of pebble length between area A and 
area B. (7) 

b State the assumption you have made 
about the populations of pebble lengths 
in order to carry out the test. (1) 

< Section 6.4 


The masses, in grams, of apples are 
assumed to follow a normal 
distribution. 


The masses of apples sold by a 
supermarket have variance o%. A random 
sample of 4 apples from the supermarket 
had the following masses: 


114 110 119 = 123 
a Find a 95% confidence interval for o%. 


(4) 
The masses of apples sold on a market 
stall have variance o;,. A second random 
sample of 7 apples was taken from the 
market stall. The sample variance s;, of 
the apples was 318.8. 
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b Stating your hypotheses clearly, test, at 
the 1% level of significance, whether or 


not there is evidence that oj, > 02. (7) 
€ Sections 6.1, 6.4 


A nutritionist studied the levels of 
cholesterol, ¥ mg/cm’, of male students 
at a large college. She assumed that 

X ~ N(y, 07) and examined a random 
sample of 25 male students. Using this 
sample, she obtained the following 
unbiased estimates for jp and o7: 


f= 1.68 a= 1,79 
a Using the ¢-distribution, find a 95% 
confidence interval for pu. (6) 
b Obtain a 95% confidence interval 
for 0’. (6) 


A cholesterol reading of more than 

2.5 mg/cm*?is regarded as high. 

c Use appropriate confidence limits from 
parts a and b to find the lowest estimate 
of the proportion of male students in 
the college with high cholesterol. (2) 

< Sections 6.1, 7.1 


A doctor wishes to study the level of 
blood glucose in males. The level of blood 
glucose is normally distributed. The 
doctor measured the blood glucose of 
10 randomly selected male students from 
a school. The results, in mmol/litre, are 
given below. 
47 36 38 47 4.1 
22 36 40 44 5.0 
a Calculate a 95% confidence interval for 
the mean. (6) 
b Calculate a 95% confidence interval for 
the variance. (6) 
A blood glucose reading of more than 
7 mmol/litre is counted as high. 
c Use appropriate confidence limits 
from parts a and b to find the highest 
estimate of the proportion of male 
students in the school with a high 
blood glucose level. (2) 


< Sections 6.1, 7.1 
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A supervisor wishes to cheek the typing 
speed of a new typist. On 10 randomly 
selected occasions, the supervisor records 
the time taken for the new typist to type 
100 words. The results, in seconds, are 
given below. 


110. 125) 130s 126-128 
127) 0 118) 120s 122s: 125 


The supervisor assumes that the time taken 
to type 100 words is normally distributed. 


a Calculate a 95% confidence interval for 
i the mean 
ii the variance 
of the population of times taken by 
this typist to type 100 words. (12) 
The supervisor requires the average 
time needed to type 100 words to be no 
more than 130 seconds and the standard 
deviation to be no more than 4 seconds. 


b Comment on whether or not the 
supervisor should be concerned about 
the speed of the new typist. (2) 

€ Sections 6.1, 7.1 


The length, ¥ mm, of a spring made by a 
machine is normally distributed with 

X ~N(u, 0”). A random sample of 20 
springs is selected and their lengths 
measured in mm. Using this sample, the 
unbiased estimates of y and o° are: 


x = 100.6 s= 15 
Stating your hypotheses clearly, test, at 
the 10% level of significance, 


a whether or not the variance of the 
lengths of springs is different from 0.9 
(7) 
b whether or not the mean length of the 
springs is greater than 100mm. You 
should use the ¢-distribution to carry 
out your test. (6) 
< Sections 6.2, 7.2 


A grocer receives deliveries of 
cauliflowers from two different growers, A 
and B. The grocer takes random samples 
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of cauliflowers from those supplied by 
each grower. He measures the weight, x, 
in grams, of each cauliflower. The results 
are summarised in the table below. 


Sample 5 

size mux mux 
A 11 6600 3 960 540 
B 13 9815 7410579 


a Show, at the 10% significance level, 
that the variances of the populations 
from which the samples are drawn 
can be assumed to be equal by testing 
the hypothesis Hy: oj = 0% against 
hypothesis H,: oj 4 03. 

(You may assume that the two samples 
come from normal populations.) (7) 


The grocer believes that the mean weight 
of cauliflowers provided by B is at least 
150 g more than the mean weight of 
cauliflowers provided by A. 


b Use a 5% significance level to test the 
grocer’s belief. (6) 


c Justify your choice of test. (1) 
€ Sections 6.3, 7.5 


An educational researcher is testing 

the effectiveness of a new method of 
teaching a topic in mathematics. 

A random sample of 10 children were 
taught by the new method and a second 
random sample of 9 children, of similar 
age and ability, were taught by the 
conventional method. At the end of the 
teaching, the same test was given to both 
groups of children. 


The marks obtained by the two groups 
are summarised in the table below. 


New Conventional 
method method 
Mean (x) 82.3 78.2 
Standard 
deviation (s) ae a 
Number of 
students (7) ” - 


A 
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a Stating your hypotheses clearly and 
using a 5% level of significance, 
investigate whether or not 
i the variance of the marks of 

children taught by the conventional 

method is greater than that of 
children taught by the new method 
the mean score of children taught 
by the conventional method is lower 
than the mean score of those taught 
by the new method. 


le 


i 


[In each case you should give full 
details of the calculation of the test 


statistics. ] (12) 
b State any assumptions you made in 
order to carry out these tests. (2) 


c Find a 95% confidence interval for the 
common variance of the marks of the 
two groups. (5) 

€ Sections 6.3, 7.5 


A large number of students are split into 
two groups, A and B. The students sit the 
same test but under different conditions. 
Group A has music playing in the room 
during the test, and group B has no music 
playing during the test. Small samples 

are then taken from each group and their 
marks recorded. The marks are normally 
distributed. 


The marks are as follows: 


Sample from group 4 

42 40 35 37 34 43 42 44 49 
Sample from group B 

40 44 38 47 38 37 33 


a Stating your hypotheses clearly, and 
using a 10% level of significance, test 
whether or not there is evidence of a 
difference between the variances of the 
marks of the two groups. (6) 


b State clearly an assumption you have 
made to enable you to carry out the 
test in part a. 


(1) 
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c Use a two tailed test, with a 5% level 
of significance, to determine whether 
playing music during the test made 
any difference to the mean marks of 
the two groups. State your hypotheses 
clearly. (6) 

d Write down what you can conclude 
about the effect of music on a student’s 
performance during the test. (1) 

< Sections 6.3, 7.5 


A company undertakes investigations 
to compare fuel consumption x, in 
miles per gallon, of two different cars, 
the Relaxant and the Elegane, with a 
view to purchasing a number of cars. 
A random sample of 13 Relaxants and 
an independent random sample of 7 
Eleganes were taken and the following 
statistics calculated. 


Sample | Sample | Sample 
Car : = : 2 
sizen | mean xX | variance s 
Relaxant 13 32.31 14.48 
Elegane 7 28.43 35.79 


The company assumes that fuel 
consumption for each make of car 
follows a normal distribution. 


a Stating your hypotheses clearly test, at 
the 10% level of significance, whether 
or not the two distributions have the 
same variance. (6) 

b Stating your hypotheses clearly test, at 
the 5% level of significance, whether or 
not there is a difference in mean fuel 
consumption between the two types of 
car. (6) 


c Explain the importance of the 
conclusion to the test in part a in 
justifying the use of the test in part b. (2) 


d State two factors which might be 
considered when undertaking an 
investigation into fuel consumption of 
two models of car to ensure that a fair 
comparison is made. (2) 

© Sections 6.3, 7.5 
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A town council is concerned that the 
mean price of renting two-bedroom flats 
in the town has exceeded £650 per month. 
A random sample of eight two-bedroom 
flats had the following rent prices, x, in 
pounds per month. 


705 
800 


640 
620 


560 
580 


680 
760 


[You may assume 
ix = 5345, Sx = 3.621025] 


a Find a 90% confidence interval for the 
mean price of renting a two-bedroom 
flat. (6) 


b State an assumption that is required 
for the validity of your interval in 
part a. 


(1) 


c Comment on whether or not the town 
council is justified in being concerned. 
Give a reason for your answer. (2) 

< Section 7.1 


Historical records from a large colony of 
squirrels show that the weight of squirrels 
is normally distributed with a mean of 

1012 g. Following a change in the diet of 
squirrels, a biologist is interested in whether 
or not the mean weight has changed. 


A random sample of 14 squirrels is 
weighed and their weights, x, in grams, 
recorded. The results are summarised as 
follows: 


2% 213-700 


Stating your hypotheses clearly, and using 
a suitable ¢-distribution, test, at the 

5% level of significance, whether or not 
there has been a change in the mean 
weight of the squirrels. 


vx? = 13 448 750 


(7) 


€ Section 7.2 


A machine is set to fill bags with flour 
such that the mean weight is 1010 grams. 


To check that the machine is working 
properly, a random sample of 8 bags is 
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selected. The weight of flour, in grams, in 
each bag is as follows: 


1010 31015 1005 = 1000 
998 1008 1012 1007 


Carry out a suitable test, at the 5% 

significance level, to determine whether or 

not the mean weight of flour in the bags 

is less than 1010 grams. (You may assume 

that the weight of flour delivered by the 

machine is normally distributed.) (7) 
< Section 7.2 


27 A doctor believes that the span of a 
person’s dominant hand is greater than 
that of the weaker hand. To test this 
theory, the doctor measures the spans 
of the dominant and weaker hands of a 
random sample of 8 people. The spans, 
in mm, are summarised in the table 


below. 
Person A|B/C|D/E/F|G/H 
Dominant | 599 |251|215]235|210| 195] 191 |230 
hand 
Weaker! 195|249|218]234]211|197| 181 |225 
hand 


Carry out a paired f-test, at the 5% 

significance level, to determine whether 

the doctor’s belief is correct. (7) 
< Section 7.3 


28 A group of 10 technology students is 
assessed by coursework and a written 
examination. The marks, given as 
percentages, are shown in the table below. 


Student Coursework | Written exam 
1 65 61 
2 73 76 
3 62 65 
4 81 77 
5 78 72 
6 74 71 
7 68 72 
8 59 42 
9 76 69 

10 70 63 
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a Usea suitable f-test to determine 
whether or not the coursework marks 
are significantly higher than the written 
examination marks. Use a 5% level of 
significance. (7) 

b State an assumption about the 
distribution of marks that is needed to 
make the above test valid. (1) 

€ Section 7.3 


An engineer decided to investigate 
whether or not the strength of rope was 
affected by water. A random sample of 9 
pieces of rope was taken and each piece 
was cut in half. One half of each piece 
was soaked in water over night, and then 
each piece of rope was tested to find its 
strength. The results, in coded units, are 
given in the table below. 


Rope 
number 


1);2);3/4);5]6}]7)]8|9 


Dry rope |9.7| 8.5/6.3) 8.3] 7.2|5.4/6.8)8.1]5.9 


30 


Wet rope |9.1|9.5/8.2)9.7] 8.5] 4.9/8.4) 8.7] 7.7 


Assuming that the strength of rope 
follows a normal distribution, test 
whether or not there is any difference 
between the mean strengths of dry and 
wet rope. State your hypotheses clearly 
and use a 1% level of significance. (7) 
< Section 7.3 


As part of an investigation into the 
effectiveness of solar heating, a pair of 
houses was identified where the mean 
weekly fuel consumption was the same. 
One of the houses was then fitted with 
solar heating and the other was not. 
Following the fitting of the solar heating, 
a random sample of 9 weeks was taken. 
The table below shows the weekly fuel 
consumption for each house. 


Week 1/213141516|)71]8]9 
Without solar | 19] 19/18/14) 6 | 7/5 [31/43 
heating 
With solar =f 13] 99/11/16/14| 1 | 0 [20/38 
heating 


(E/P) 31 


(E/P) 32 


a Stating your hypotheses clearly, test, 
at the 5% level of significance, whether 
or not there is evidence that the solar 
heating reduces the mean weekly fuel 
consumption. (7) 

b State an assumption about weekly fuel 
consumption that is required to carry 
out this test. (1) 

< Section 7.3 


Two methods of extracting juice from an 
orange are to be compared. Eight oranges 
are halved. One half of each orange 

is chosen at random and allocated to 
Method 4 and the other half is allocated 
to Method B. The amounts of juice 
extracted, in ml, are given in the table. 


Orange 


1/2);3)/4]/5/6/7)]8 


Method A | 29 | 30 | 26 | 25 | 26 | 22 | 23 | 28 


Method B | 27 | 25 | 28 | 24 | 23 | 26 | 22 | 25 


One statistician suggests performing a 
two-sample f-test to investigate whether 
or not there is a difference between the 
mean amounts of juice extracted by the 
two methods. 

a Stating your hypotheses clearly and 
using a 5% significance level, carry out 
this test. 

(You may assume X, = 26.125, s7 = 7.84, 

ty= 25,5, =400d 67 =o) (7) 

Another statistician suggests analysing 

these data using a paired t-test. 

b Using a 5% significance level, carry out 
this test. (7) 

c State which of these two tests you 
consider to be more appropriate. 

Give a reason for your choice. (2) 
< Sections 7.3, 7.5 


Brickland and Goodbrick are two 
manufacturers of bricks. The lengths 
of the bricks produced by each 
manufacturer can be assumed to be 
normally distributed. A random sample 
of 20 bricks is taken from Brickland 


Review exercise 2 


and the length, x mm, of each brick is 
recorded. The mean of this sample is 
207.1 mm and the variance is 3.2 mm/?. 


a Calculate the 98% confidence interval 
for the mean length of brick from 
Brickland. (6) 

A random sample of 10 bricks is selected 

from those manufactured by Goodbrick. 

The length of each brick, ymm, is 

recorded. The results are summarised as 

follows: 


doy = 2046.2 >i y2= 418 785.4 


The variances of the length of brick for each 
manufacturer are assumed to be the same. 


b Find a 90% confidence interval for the 
value by which the mean length of brick 
made by Brickland exceeds the mean 
length of brick made by Goodbrick. (7) 

< Sections 7.1, 7.5 


Challenge 


1 A random sample of three independent 
variables X,, X,and X,is taken from a 
distribution with mean yz and variance o%. 

a Show that £X, — 3X,+2Y, is an unbiased 
estimator for ju. 

An unbiased estimator for jz is given by 

ju=aX,+ bX,where a and b are constants. 

b Show that Var(~i) = (2a? — 2a + 1)o°. 

c Hence determine the value of a and the 
value of b for which jf has minimum variance. 

< Section 5.1 


2 The random variable _X has the continuous 
uniform distribution U[0, 1]. A random sample of 
three independent observations X,, X, and_X, is 
taken from_X. The random variable / is defined 
as the median of these three observations. 


a Show that the probability density function of 


M is given by: 
6x(1 - x) 0<x<1 
h = 
2 ‘e otherwise 


b Hence show that M is an unbiased estimator 
for the median of X. 


c Find the standard error of this estimator. 
€ Sections 3.1, 5.1 
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Further Mathematics 
AS Level 
Further Statistics 2 


Time: 50 minutes 
You must have: Mathematical Formulae and Statistical Tables, Calculator 


1 A continuous random variable X has a probability density function given by 


fa) = {5 l=x<2 

0 otherwise 

where & is a positive constant. 

a Sketch f(x). (2) 
b State the mode of X. (1) 
c Show that k= = (2) 
d Use algebraic integration to find E(X). (3) 
e Define fully the cumulative distribution function F(x). (4) 
f Show that the median, m, of X satisfies the equation 2m? — 28m + 33 = 0. (2) 
Given that m = 1.357, 

g Comment on the skewness of the distribution of X. (1) 


2 The time taken, in seconds, for the lift to arrive at Gladys’ floor of her hotel is modelled 
by the continuous random variable X, which is uniformly distributed over 0 < x < 40. 


a Write down the probability distribution function f(x). (1) 
Find: 
b E(X) (1) 
ce Var(X) (2) 
d P(I5 <X S 30) (2) 
e Given that Gladys has already spent 15 seconds waiting for the lift, find the probability 

that it will arrive in the next ten seconds. (2) 
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3 A group of 10 languages students sat tests in French and Spanish. Their results were as follows, 
where x represents their French score and y their Spanish score. 


21 5 13} 16 | 12 | 15 | 18 | 28 | 17 | 19 
y 40 | 31 | 23 | 27 | 26 | 38 | 21 | 47 | 33 | 30 


a Calculate Spearman’s rank correlation coefficient for these data. (4) 


b Stating your hypotheses clearly, test, at the 5% level of significance, whether or not there is an 
association between French and Spanish scores. (4) 


The product-moment correlation coefficient for these results is 0.568. 


c Stating your hypotheses clearly, test, at the 5% level of significance, whether or not the 
correlation coefficient is greater than zero. (3) 


d State a reason why the conclusions of the two tests seem to conflict. (1) 


4 The owner of a riverside café is investigating the relationship between the daily mean 
temperature and the sales of ice-cream and coffee. He takes a random sample of 5 days in 
August and records the temperature, ¢ (°C), and the sales of coffee, c (£100s). The data is shown 
in the table below. 

t 12 15 19 20 22 

c 7 6.8 6.5 6.1 2.9 


Summary statistics are calculated and found to be 

See O52 S.,= 0.852 dete = 561.3 dit = 88 Nes 20 
He also collects data on sales of ice-cream on the same five days and calculates that the residual 
sum of squares (RSS) is 0.0524. 


Explain, with clear reasons, whether the data for the sales of ice-cream or the data for the sales 
of coffee is more likely to fit a linear model. (5) 
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Further Mathematics 
A Level 
Further Statistics 2 


Time: 1 hour 30 minutes 
You must have: Mathematical Formulae and Statistical Tables, Calculator 


1 The weights of male warthogs are normally distributed with a mean of 90 kg and a standard 

deviation of 10kg. 
The weights of female warthogs are normally distributed with a mean of 60 kg and a standard 
deviation of Skg. 
Given that the weights of male and female warthogs are independent, find the probability that: 
a 5 randomly chosen males and 2 randomly chosen females will weigh more than 560 kg in 

total, (5) 
b arandomly chosen male will weigh less than 1.4 times a randomly chosen female. (6) 


2 A forestry worker is testing the effect of using a fertiliser on willow saplings. Two independent 
random samples of saplings are selected and their height gained over a 20-day period is 
recorded. One sample of 10 saplings is given the fertiliser while the other sample of 13 saplings 
is placed in an identical environment but without fertiliser. The heights gained (x cm) by both 
groups of saplings are summarised by the statistics in the table below. 


Sample size Mean x Standard deviation s 
With fertiliser 10 23.36 5.29 
Without fertiliser 13 19.96 6.84 


a Use a two-tailed test to show that, at the 10% level of significance, the variances of the heights 
gained by the saplings with and without fertiliser can be assumed to be equal. State your 


hypotheses clearly. (4) 
b Stating your hypotheses clearly test, at the 5% level of significance, whether or not there is a 

difference in the mean height gained by the two groups of saplings. (7) 
c State the importance of the test in a to your test in part b. (1) 


3 A doctor believes that the span of an adult male’s hand, in mm, is normally distributed with 
a mean of j4mm and a standard deviation of cmm. A random sample of 6 men’s hands were 
measured. Using this sample, she obtained unbiased estimates of y and o’ as fi and 6°. 

A 95% confidence interval for js was found to be (206.2, 223.5). 


a Show that 6? = 67.9 (correct to 3 significant figures). (4) 

b Obtain a 95% confidence interval for co. (3) 

c Use appropriate confidence limits to find, to 2 decimal places, the highest estimate of the 
proportion of adult males with a hand span greater than 230mm. (4) 
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4 A researcher thinks that there is a link between a person’s confidence and their height. 
She devises a test to measure the confidence, c, of nine people and their height, /cm. 
The data are shown in the table below. 


h 179 | 169 | 187 | 166 | 162 | 193 | 161 | 177 | 168 

c 569 | 561 | 579 | 561 | 540 | 598 | 542 | 565 | 573 
doh = 1562, Soc = 5088, Sohe = 884484, Soh? = 272 094, Soc? = 2878 966 
a Find the values of S;,,, S..and S,,. (3) 
b Calculate the equation of the regression line of c on h, giving your answer in the form 

c=a+tbh 
where a and + are given to 3 significant figures. (2) 

ec Give an interpretation of the value of 5. (1) 
d Explain why it would not make sense to provide an interpretation of a. (1) 


The researcher decides to use this regression model to predict a person’s confidence but is told 
that one of the values for confidence is incorrectly recorded. 


e i Calculate the residual values. 


ii Hence identify the incorrect value, giving a reason for your answer. (3) 
f Ignoring the incorrect value, produce a new model. (2) 
g Use the new model to predict the confidence of a person who is 172 cm tall. (1) 


5 The table shows the qualifying lap-times of eight drivers in an amateur motorbike race. 


Driver Amy Carl | Dhruv | David Ali Paula | Sarah | Jake 
Qualifying time (mm:ss) | 3:45 2:52 3:07 2:49 3:49 2:50 Oey | 3:11 


In the actual race, the drivers finished in the following order, with the quickest driver first: 
Carl, Paula, Sarah, David, Dhruv, Amy, Jake, Ali 


a Calculate Spearman’s rank correlation coefficient for these results. (4) 

b Stating your hypotheses clearly, test, at the 5% significance level, whether or not there is a 
positive correlation between qualifying lap-times and actual race results. (4) 

c Justify the use of Spearman’s rank correlation coefficient for this test. (1) 


In another race, qualifying was abandoned due to poor weather, resulting in four drivers being 

allocated a qualifying time of 3:00. 

d Briefly explain how you could calculate Spearman’s rank correlation coefficient in this 
situation. (2) 


6 A ccontinuous random variable X has a cumulative distribution function given by 


0 x <0 
F(x) ={5x(x4+1) O0<x<1 

1 x>1 
a Use algebraic integration to find E(XY) and Var(X). (5) 
b Find the mode of _X, giving a reason for your answer. (2) 
c Describe the skewness of the distribution of X. Give a reason for your answer. (1) 
kis a constant such thatO <<kS< i. 
d Show that P(k <_X < 3k) =k3(40k + 13). (3) 
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7 A wooden pole AB is 6m long. The pole is sawn into two pieces at a randomly chosen point P. 
The random variable XY ~ U[0, 6] represents the length in metres of the section AP. 


The two sections of the pole are used to form a framework for the cross section of a tent, with 
angle APB = 90°, as shown in the diagram below: 
P 


A B 


Find the expected area enclosed by the framework and the ground, shown shaded on the 
diagram. (6) 
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The tabulated value is PLY < x), where X has a binomial distribution with index n and parameter p. 


BINOMIAL CUMULATIVE DISTRIBUTION FUNCTION 
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1.0000 
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0.5905 


0.9185 
0.9914 
0.9995 
1.0000 


0.4437 


0.8352 
0.9734 
0.9978 
0.9999 


0.3277 


0.7373 
0.9421 
0.9933 
0.9997 


0.2373 


0.6328 
0.8965 
0.9844 
0.9990 


0.1681 


0.5282 
0.8369 
0.9692 
0.9976 


0.1160 


0.4284 
0.7648 
0.9460 
0.9947 


0.0778 


0.3370 
0.6826 
0.9130 
0.9898 


0.0503 


0.2562 
0.5931 
0.8688 
0.9815 


0.0312 


0.1875 
0.5000 
0.8125 
0.9688 


0.7351 


0.9672 
0.9978 
0.9999 
1.0000 
1.0000 


0.5314 


0.8857 
0.9842 
0.9987 
0.9999 
1.0000 


0.3771 


0.7765 
0.9527 
0.9941 
0.9996 
1.0000 


0.2621 


0.6554 
0.9011 
0.9830 
0.9984 
0.9999 


0.1780 


0.5339 
0.8306 
0.9624 
0.9954 
0.9998 


0.1176 


0.4202 
0.7443 
0.9295 
0.9891 
0.9993 


0.0754 


0.3191 
0.6471 
0.8826 
0.9777 
0.9982 


0.0467 


0.2333 
0.5443 
0.8208 
0.9590 
0.9959 


0.0277 


0.1636 
0.4415 
0.7447 
0.9308 
0.9917 


0.0156 


0.1094 
0.3438 
0.6563 
0.8906 
0.9844 


n=7,x= 
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0.9556 
0.9962 
0.9998 
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1.0000 


0.4783 


0.8503 
0.9743 
0.9973 
0.9998 
1.0000 


1.0000 


0.3206 


0.7166 
0.9262 
0.9879 
0.9988 
0.9999 


1.0000 


0.2097 


0.5767 
0.8520 
0.9667 
0.9953 
0.9996 


1.0000 


0.1335 


0.4449 
0.7564 
0.9294 
0.9871 
0.9987 


0.9999 


0.0824 


0.3294 
0.6471 
0.8740 
0.9712 
0.9962 


0.9998 


0.0490 


0.2338 
0.5323 
0.8002 
0.9444 
0.9910 
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0.0280 
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0.6083 
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0.9963 


0.0078 


0.0625 
0.2266 
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0.7734 
0.9375 


0.9922 
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0.6634 
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0.9942 
0.9996 
1.0000 
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1.0000 
1.0000 
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0.9950 
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1.0000 


1.0000 
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0.9971 
0.9998 


1.0000 
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0.1678 
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0.9437 
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0.9988 


0.9999 
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0.8862 
0.9727 
0.9958 


0.9996 
1.0000 


0.0576 


0.2553 
0.5518 
0.8059 
0.9420 
0.9887 


0.9987 
0.9999 


0.0319 


0.1691 
0.4278 
0.7064 
0.8939 
0.9747 


0.9964 
0.9998 


0.0168 


0.1064 
0.3154 
0.5941 
0.8263 
0.9502 


0.9915 
0.9993 


0.0084 


0.0632 
0.2201 
0.4770 
0.7396 
0.9115 


0.9819 
0.9983 


0.0039 


0.0352 
0.1445 
0.3633 
0.6367 
0.8555 


0.9648 
0.9961 


SI|AD AKWN eK 


0.6302 


0.9288 
0.9916 
0.9994 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.3874 


0.7748 
0.9470 
0.9917 
0.9991 
0.9999 


1.0000 
1.0000 
1.0000 


0.2316 


0.5995 
0.8591 
0.9661 
0.9944 
0.9994 


1.0000 
1.0000 
1.0000 


0.1342 


0.4362 
0.7382 
0.9144 
0.9804 
0.9969 


0.9997 
1.0000 
1.0000 


0.0751 


0.3003 
0.6007 
0.8343 
0.9511 
0.9900 


0.9987 
0.9999 
1.0000 


0.0404 


0.1960 
0.4628 
0.7297 
0.9012 
0.9747 


0.9957 
0.9996 
1.0000 


0.0207 


0.1211 
0.3373 
0.6089 
0.8283 
0.9464 
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0.9986 
0.9999 


0.0101 


0.0705 
0.2318 
0.4826 
0.7334 
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0.9962 
0.9997 


0.0046 
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0.9992 


0.0020 
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0.9990 
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1.0000 
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0.9872 
0.9984 
0.9999 


1.0000 
1.0000 
1.0000 
1.0000 
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0.5443 
0.8202 
0.9500 
0.9901 
0.9986 


0.9999 
1.0000 
1.0000 
1.0000 


0.1074 


0.3758 
0.6778 
0.8791 
0.9672 
0.9936 


0.9991 
0.9999 
1.0000 
1.0000 


0.0563 


0.2440 
0.5256 
0.7759 
0.9219 
0.9803 


0.9965 
0.9996 
1.0000 
1.0000 


0.0282 


0.1493 
0.3828 
0.6496 
0.8497 
0.9527 


0.9894 
0.9984 
0.9999 
1.0000 


0.0135 


0.0860 
0.2616 
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0.9051 
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0.9952 
0.9995 
1.0000 
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0.0464 
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0.6331 
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0.9983 
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0.6230 
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0.9453 
0.9893 
0.9990 
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0.9996 
1.0000 
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0.0850 
0.2528 
0.4925 
0.7237 
0.8822 


0.9614 
0.9905 
0.9983 
0.9998 
1.0000 


1.0000 


0.0057 


0.0424 
0.1513 
0.3467 
0.5833 
0.7873 


0.9154 
0.9745 
0.9944 
0.9992 
0.9999 


1.0000 


0.0022 


0.0196 
0.0834 
0.2253 
0.4382 
0.6652 


0.8418 
0.9427 
0.9847 
0.9972 
0.9997 


1.0000 


0.0008 


0.0083 
0.0421 
0.1345 
0.3044 
0.5269 


0.7393 
0.8883 
0.9644 
0.9921 
0.9989 


0.9999 


0.0002 


0.0032 
0.0193 
0.0730 
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0.3872 


0.6128 
0.8062 
0.9270 
0.9807 
0.9968 


0.9998 
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0.4633 


0.8290 
0.9638 
0.9945 
0.9994 
0.9999 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
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0.5490 
0.8159 
0.9444 
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0.9978 


0.9997 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 


0.0874 


0.3186 
0.6042 
0.8227 
0.9383 
0.9832 


0.9964 
0.9994 
0.9999 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 


0.0352 


0.1671 
0.3980 
0.6482 
0.8358 
0.9389 


0.9819 
0.9958 
0.9992 
0.9999 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 


0.0134 


0.0802 
0.2361 
0.4613 
0.6865 
0.8516 


0.9434 
0.9827 
0.9958 
0.9992 
0.9999 


1.0000 
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1.0000 
1.0000 


0.0047 
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0.9963 
0.9993 


0.9999 
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0.0142 
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0.9578 
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0.9972 
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0.9999 
1.0000 
1.0000 
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0.0052 
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0.9050 
0.9662 
0.9907 
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0.9997 
1.0000 
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0.0107 
0.0424 
0.1204 
0.2608 


0.4522 
0.6535 
0.8182 
0.9231 
0.9745 


0.9937 
0.9989 
0.9999 
1.0000 


0.0000 


0.0005 
0.0037 
0.0176 
0.0592 
0.1509 


0.3036 
0.5000 
0.6964 
0.8491 
0.9408 


0.9824 
0.9963 
0.9995 
1.0000 


0.3585 


0.7358 
0.9245 
0.9841 
0.9974 
0.9997 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.1216 


0.3917 
0.6769 
0.8670 
0.9568 
0.9887 


0.9976 
0.9996 
0.9999 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0388 


0.1756 
0.4049 
0.6477 
0.8298 
0.9327 


0.9781 
0.9941 
0.9987 
0.9998 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0115 


0.0692 
0.2061 
0.4114 
0.6296 
0.8042 


0.9133 
0.9679 
0.9900 
0.9974 
0.9994 


0.9999 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0032 


0.0243 
0.0913 
0.2252 
0.4148 
0.6172 


0.7858 
0.8982 
0.9591 
0.9861 
0.9961 


0.9991 
0.9998 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0008 


0.0076 
0.0355 
0.1071 
0.2375 
0.4164 


0.6080 
0.7723 
0.8867 
0.9520 
0.9829 


0.9949 
0.9987 
0.9997 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0002 


0.0021 
0.0121 
0.0444 
0.1182 
0.2454 


0.4166 
0.6010 
0.7624 
0.8782 
0.9468 


0.9804 
0.9940 
0.9985 
0.9997 
1.0000 


1.0000 
1.0000 
1.0000 


0.0000 


0.0005 
0.0036 
0.0160 
0.0510 
0.1256 


0.2500 
0.4159 
0.5956 
0.7553 
0.8725 


0.9435 
0.9790 
0.9935 
0.9984 
0.9997 


1.0000 
1.0000 
1.0000 


0.0000 


0.0001 
0.0009 
0.0049 
0.0189 
0.0553 


0.1299 
0.2520 
0.4143 
0.5914 
0.7507 


0.8692 
0.9420 
0.9786 
0.9936 
0.9985 


0.9997 
1.0000 
1.0000 


0.0000 


0.0000 
0.0002 
0.0013 
0.0059 
0.0207 


0.0577 
0.1316 
0.2517 
0.4119 
0.5881 


0.7483 
0.8684 
0.9423 
0.9793 
0.9941 


0.9987 
0.9998 
1.0000 
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0.2774 


0.6424 
0.8729 
0.9659 
0.9928 
0.9988 


0.9998 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0718 


0.2712 
0.5371 
0.7636 
0.9020 
0.9666 


0.9905 
0.9977 
0.9995 
0.9999 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0172 


0.0931 
0.2537 
0.4711 
0.6821 
0.8385 


0.9305 
0.9745 
0.9920 
0.9979 
0.9995 


0.9999 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0038 


0.0274 
0.0982 
0.2340 
0.4207 
0.6167 


0.7800 
0.8909 
0.9532 
0.9827 
0.9944 


0.9985 
0.9996 
0.9999 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0008 


0.0070 
0.0321 
0.0962 
0.2137 
0.3783 


0.5611 
0.7265 
0.8506 
0.9287 
0.9703 


0.9893 
0.9966 
0.9991 
0.9998 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0001 


0.0016 
0.0090 
0.0332 
0.0905 
0.1935 


0.3407 
0.5118 
0.6769 
0.8106 
0.9022 


0.9558 
0.9825 
0.9940 
0.9982 
0.9995 


0.9999 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0000 


0.0003 
0.0021 
0.0097 
0.0320 
0.0826 


0.1734 
0.3061 
0.4668 
0.6303 
0.7712 


0.8746 
0.9396 
0.9745 
0.9907 
0.9971 


0.9992 
0.9998 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0000 


0.0001 
0.0004 
0.0024 
0.0095 
0.0294 


0.0736 
0.1536 
0.2735 
0.4246 
0.5858 


0.7323 
0.8462 
0.9222 
0.9656 
0.9868 


0.9957 
0.9988 
0.9997 
0.9999 
1.0000 


1.0000 
1.0000 


0.0000 


0.0000 
0.0001 
0.0005 
0.0023 
0.0086 


0.0258 
0.0639 
0.1340 
0.2424 
0.3843 


0.5426 
0.6937 
0.8173 
0.9040 
0.9560 


0.9826 
0.9942 
0.9984 
0.9996 
0.9999 


1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0001 
0.0005 
0.0020 


0.0073 
0.0216 
0.0539 
0.1148 
0.2122 


0.3450 
0.5000 
0.6550 
0.7878 
0.8852 


0.9461 
0.9784 
0.9927 
0.9980 
0.9995 


0.9999 
1.0000 


0.2146 


0.5535 
0.8122 
0.9392 
0.9844 
0.9967 


0.9994 
0.9999 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0424 


0.1837 
0.4114 
0.6474 
0.8245 
0.9268 


0.9742 
0.9922 
0.9980 
0.9995 
0.9999 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0076 


0.0480 
0.1514 
0.3217 
0.5245 
0.7106 


0.8474 
0.9302 
0.9722 
0.9903 
0.9971 


0.9992 
0.9998 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0012 


0.0105 
0.0442 
0.1227 
0.2552 
0.4275 


0.6070 
0.7608 
0.8713 
0.9389 
0.9744 


0.9905 
0.9969 
0.9991 
0.9998 
0.9999 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0002 


0.0020 
0.0106 
0.0374 
0.0979 
0.2026 


0.3481 
0.5143 
0.6736 
0.8034 
0.8943 


0.9493 
0.9784 
0.9918 
0.9973 
0.9992 


0.9998 
0.9999 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0000 


0.0003 
0.0021 
0.0093 
0.0302 
0.0766 


0.1595 
0.2814 
0.4315 
0.5888 
0.7304 


0.8407 
0.9155 
0.9599 
0.9831 
0.9936 


0.9979 
0.9994 
0.9998 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0000 


0.0000 
0.0003 
0.0019 
0.0075 
0.0233 


0.0586 
0.1238 
0.2247 
0.3575 
0.5078 


0.6548 
0.7802 
0.8737 
0.9348 
0.9699 


0.9876 
0.9955 
0.9986 
0.9996 
0.9999 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0003 
0.0015 
0.0057 


0.0172 
0.0435 
0.0940 
0.1763 
0.2915 


0.4311 
0.5785 
0.7145 
0.8246 
0.9029 


0.9519 
0.9788 
0.9917 
0.9971 
0.9991 


0.9998 
1.0000 
1.0000 
1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0000 
0.0002 
0.0011 


0.0040 
0.0121 
0.0312 
0.0694 
0.1350 


0.2327 
0.3592 
0.5025 
0.6448 
0.7691 


0.8644 
0.9286 
0.9666 
0.9862 
0.9950 


0.9984 
0.9996 
0.9999 
1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0000 
0.0000 
0.0002 


0.0007 
0.0026 
0.0081 
0.0214 
0.0494 


0.1002 
0.1808 
0.2923 
0.4278 
0.5722 


0.7077 
0.8192 
0.8998 
0.9506 
0.9786 


0.9919 
0.9974 
0.9993 
0.9998 
1.0000 
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0.1285 


0.3991 
0.6767 
0.8619 
0.9520 


0.9861 


0.9966 
0.9993 
0.9999 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0148 


0.0805 
0.2228 
0.4231 
0.6290 


0.7937 


0.9005 
0.9581 
0.9845 
0.9949 
0.9985 


0.9996 
0.9999 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0015 


0.0121 
0.0486 
0.1302 
0.2633 


0.4325 


0.6067 
0.7559 
0.8646 
0.9328 
0.9701 


0.9880 
0.9957 
0.9986 
0.9996 
0.9999 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0001 


0.0015 
0.0079 
0.0285 
0.0759 


0.1613 


0.2859 
0.4371 
0.5931 
0.7318 
0.8392 


0.9125 
0.9568 
0.9806 
0.9921 
0.9971 


0.9990 
0.9997 
0.9999 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0000 


0.0001 
0.0010 
0.0047 
0.0160 


0.0433 


0.0962 
0.1820 
0.2998 
0.4395 
0.5839 


0.7151 
0.8209 
0.8968 
0.9456 
0.9738 


0.9884 
0.9953 
0.9983 
0.9994 
0.9998 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0000 


0.0000 
0.0001 
0.0006 
0.0026 


0.0086 


0.0238 
0.0553 
0.1110 
0.1959 
0.3087 


0.4406 
0.5772 
0.7032 
0.8074 
0.8849 


0.9367 
0.9680 
0.9852 
0.9937 
0.9976 


0.9991 
0.9997 
0.9999 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0001 
0.0003 


0.0013 


0.0044 
0.0124 
0.0303 
0.0644 
0.1215 


0.2053 
0.3143 
0.4408 
0.5721 
0.6946 


0.7978 
0.8761 
0.9301 
0.9637 
0.9827 


0.9925 
0.9970 
0.9989 
0.9996 
0.9999 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0000 
0.0000 


0.0001 


0.0006 
0.0021 
0.0061 
0.0156 
0.0352 


0.0709 
0.1285 
0.2112 
0.3174 
0.4402 


0.5681 
0.6885 
0.7911 
0.8702 
0.9256 


0.9608 
0.9811 
0.9917 
0.9966 
0.9988 


0.9996 
0.9999 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0000 
0.0000 


0.0000 


0.0001 
0.0002 
0.0009 
0.0027 
0.0074 


0.0179 
0.0386 
0.0751 
0.1326 
0.2142 


0.3185 
0.4391 
0.5651 
0.6844 
0.7870 


0.8669 
0.9233 
0.9595 
0.9804 
0.9914 


0.9966 
0.9988 
0.9996 
0.9999 
1.0000 


1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0000 
0.0000 


0.0000 


0.0000 
0.0000 
0.0001 
0.0003 
0.0011 


0.0032 
0.0083 
0.0192 
0.0403 
0.0769 


0.1341 
0.2148 
0.3179 
0.4373 
0.5627 


0.6821 
0.7852 
0.8659 
0.9231 
0.9597 


0.9808 
0.9917 
0.9968 
0.9989 
0.9997 


0.9999 
1.0000 
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0.0769 


0.2794 
0.5405 
0.7604 
0.8964 
0.9622 


0.9882 
0.9968 
0.9992 
0.9998 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0052 


0.0338 
0.1117 
0.2503 
0.4312 
0.6161 


0.7702 
0.8779 
0.9421 
0.9755 
0.9906 


0.9968 
0.9990 
0.9997 
0.9999 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0003 


0.0029 
0.0142 
0.0460 
0.1121 
0.2194 


0.3613 
0.5188 
0.6681 
0.7911 
0.8801 


0.9372 
0.9699 
0.9868 
0.9947 
0.9981 


0.9993 
0.9998 
0.9999 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0000 


0.0002 
0.0013 
0.0057 
0.0185 
0.0480 


0.1034 
0.1904 
0.3073 
0.4437 
0.5836 


0.7107 
0.8139 
0.8894 
0.9393 
0.9692 


0.9856 
0.9937 
0.9975 
0.9991 
0.9997 


0.9999 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0000 


0.0000 
0.0001 
0.0005 
0.0021 
0.0070 


0.0194 
0.0453 
0.0916 
0.1637 
0.2622 


0.3816 
0.5110 
0.6370 
0.7481 
0.8369 


0.9017 
0.9449 
0.9713 
0.9861 
0.9937 


0.9974 
0.9990 
0.9996 
0.9999 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0000 
0.0002 
0.0007 


0.0025 
0.0073 
0.0183 
0.0402 
0.0789 


0.1390 
0.2229 
0.3279 
0.4468 
0.5692 


0.6839 
0.7822 
0.8594 
0.9152 
0.9522 


0.9749 
0.9877 
0.9944 
0.9976 
0.9991 


0.9997 
0.9999 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0000 
0.0000 
0.0001 


0.0002 
0.0008 
0.0025 
0.0067 
0.0160 


0.0342 
0.0661 
0.1163 
0.1878 
0.2801 


0.3889 
0.5060 
0.6216 
0.7264 
0.8139 


0.8813 
0.9290 
0.9604 
0.9793 
0.9900 


0.9955 
0.9981 
0.9993 
0.9997 
0.9999 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0000 
0.0000 
0.0000 


0.0000 
0.0001 
0.0002 
0.0008 
0.0022 


0.0057 
0.0133 
0.0280 
0.0540 
0.0955 


0.1561 
0.2369 
0.3356 
0.4465 
0.5610 


0.6701 
0.7660 
0.8438 
0.9022 
0.9427 


0.9686 
0.9840 
0.9924 
0.9966 
0.9986 


0.9995 
0.9998 
0.9999 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0000 
0.0000 
0.0000 


0.0000 
0.0000 
0.0000 
0.0001 
0.0002 


0.0006 
0.0018 
0.0045 
0.0104 
0.0220 


0.0427 
0.0765 
0.1273 
0.1974 
0.2862 


0.3900 
0.5019 
0.6134 
0.7160 
0.8034 


0.8721 
0.9220 
0.9556 
0.9765 
0.9884 


0.9947 
0.9978 
0.9991 
0.9997 
0.9999 


1.0000 
1.0000 
1.0000 


0.0000 


0.0000 
0.0000 
0.0000 
0.0000 
0.0000 


0.0000 
0.0000 
0.0000 
0.0000 
0.0000 


0.0000 
0.0002 
0.0005 
0.0013 
0.0033 


0.0077 
0.0164 
0.0325 
0.0595 
0.1013 


0.1611 
0.2399 
0.3359 
0.4439 
0.5561 


0.6641 
0.7601 
0.8389 
0.8987 
0.9405 


0.9675 
0.9836 
0.9923 
0.9967 
0.9987 


0.9995 
0.9998 
1.0000 


Appendix 


The values z in the table are those which a random variable Z ~ N(0, 1) exceeds with probability p; that is, 


PERCENTAGE POINTS OF THE NORMAL DISTRIBUTION 


P(Z>z)=1- O(zZ)=p. 


Pp z Pp z 
0.5000 0.0000 0.0500 1.6449 
0.4000 0.2533 0.0250 1.9600 
0.3000 0.5244 0.0100 2.3263 
0.2000 0.8416 0.0050 2.5758 
0.1500 1.0364 0.0010 3.0902 
0.1000 1.2816 0.0005 3.2905 
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PERCENTAGE POINTS OF THE y? DISTRIBUTION 


Appendix 


The values in the table are those which a random variable with the y? distribution on v degrees of freedom 


exceeds with the probability shown. 


v 0.995 0.990 0.975 0.950 0.900 0.100 0.050 0.025 0.010 0.005 
1 0.000 0.000 0.001 0.004 0.016 2.705 3.841 5.024 6.635 7.879 
2 0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597 
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 12.838 
4 0.207 0.297 0.484 0.711 — 1.064 7.779 9488 11.143 13.277 14.860 
5 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.832 15.086 16.750 
6 0.676 0.872 1.237 1.635 = 2.204 10.645 12.592 14449 16.812 18.548 
7 0.989 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 20.278 
8 1.344 1.646 2.180 2.733 3.490 13.362, 15.507 17.535 20.090 21.955 
9 1.735 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 23.589 
10 2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 25.188 
11 2.603 3.053. 3.816 4.575 5.580 17.275 19.675 21.920 24.725 26.757 
12 3.074 3.571 4404 5.226 6.304 18.549 21.026 23.337 26.217 28.300 
13 3.565 4.107 5.009 5.892 7.042 19.812 22.362 24.736 27.688 29.819 
14 4.075 4660 5.629 6.571 7.790 21.064 23.685 26.119 29.141 31.319 
15 4.601 5.229 6.262 7.261 8.547 22.307 24.996 27.488 30.578 32.801 
16 5.142 5.812 6.908 7.962 9.312 23.542 26.296 28.845 32.000 34.267 
17 5.697 6408 7.564 8.672 10.085 24.769 27.587 30.191 33.409 35.718 
18 6.265 7.015 8.231 9.390 10.865 25.989 28.869 31.526 34.805 37.156 
19 6.844 7.633 8.907 10.117 11.651 27.204 30.144 32.852 36.191 38.582 
20 7.434 8.260 9.591 10.851 12.443 28.412 31.410 34.170 37.566 39.997 
21 8.034 8.897 10.283 11.591 13.240 29.615 32.671 35.479 38.932 41.401 
22 8.643 9.542 10.982 12.338 14.042 30.813 33.924 36.781 40.289 42.796 
23 9.260 10.196 11.689 13.091 14.848 32.007. 35.172 38.076 41.638 44.181 
24 9.886 10.856 12.401 13.848 15.659 33.196 36.415 39.364 42.980 45.558 
25} 10.520 11.524 13.120 14611 16.473 34.382 37.652 40.646 44.314 46.928 
26 | 11.160 12.198 13.844 15.379 17.292 35.563 38.885 41.923 45.642 48.290 
27} 11.808 12.879 14.573 16.151 18.114 36.741 40.113 43.194 46.963 49.645 
28 | 12.461 13.565 15.308 16.928 18.939 37.916 41.337 44.461 48.278 50.993 
29} 13.121 14.256 16.047 17.708 19.768 39.088 42.557 45.722 49.588 52.336 
30 | 13.787 14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 53.672 
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Appendix 


These tables concern tests of the hypothesis that a population correlation coefficient p is 0. The 


CRITICAL VALUES FOR CORRELATION COEFFICIENTS 


values in the tables are the minimum values which need to be reached by a sample correlation 


coefficient in order to be significant at the level shown, on a one-tailed test. 


Product Moment Coefficient Spearman’s Coefficient 
Level Sample Level 
0.10 0.05 0.025 0.01 0.005 Size, n 0.05 0.025 0.01 

0.8000 0.9000 0.9500 0.9800 0.9900 4 1.0000 - - 

0.6870 0.8054 0.8783 0.9343 0.9587 5 0.9000 1.0000 1.0000 
0.6084 0.7293 0.8114 0.8822 0.9172 6 0.8286 0.8857 0.9429 
0.5509 0.6694 0.7545 0.8329 0.8745 7 0.7143 0.7857 0.8929 
0.5067 0.6215 0.7067 0.7887 0.8343 8 0.6429 0.7381 0.8333 
0.4716 0.5822 0.6664 0.7498 0.7977 9 0.6000 0.7000 0.7833 
0.4428 0.5494 0.6319 0.7155 0.7646 10 0.5636 0.6485 0.7455 
0.4187 0.5214 0.6021 0.6851 0.7348 11 0.5364 0.6182 0.7091 
0.3981 0.4973 0.5760 0.6581 0.7079 12 0.5035 0.5874 0.6783 
0.3802 0.4762 0),5529 0.6339 0.6835 13 0.4835 0.5604 0.6484 
0.3646 0.4575 0.5324 0.6120 0.6614 14 0.4637 0.5385 0.6264 
0.3507 0.4409 0.5140 0.5923 0.6411 15 0.4464 0.5214 0.6036 
0.3383 0.4259 0.4973 0.5742 0.6226 16 0.4294 0.5029 0.5824 
0.3271 0.4124 0.4821 0.5577 0.6055 17 0.4142 0.4877 0.5662 
0.3170 0.4000 0.4683 0.5425 0.5897 18 0.4014 0.4716 0.5501 
0.3077 0.3887 0.4555 0.5285 0.5751 19 0.3912 0.4596 0.5351 
0.2992 0.3783 0.4438 0.5155 0.5614 20 0.3805 0.4466 0.5218 
0.2914 0.3687 0.4329 0.5034 0.5487 21 0.3701 0.4364 0.5091 
0.2841 0.3598 0.4227 0.4921 0.5368 22 0.3608 0.4252 0.4975 
0.2774 0.3515 0.4133 0.4815 0.5256 23 0.3528 0.4160 0.4862 
0.2711 0.3438 0.4044 0.4716 0.5151 24 0.3443 0.4070 0.4757 
0.2653 0.3365 0.3961 0.4622 0.5052 25 0.3369 0.3977 0.4662 
0.2598 0.3297 0.3882 0.4534 0.4958 26 0.3306 0.3901 0.4571 
0.2546 0.3233 0.3809 0.4451 0.4869 27 0.3242 0.3828 0.4487 
0.2497 0.3172 0.3739 0.4372 0.4785 28 0.3180 0.3755 0.4401 
0.2451 0.3115 0.3673 0.4297 0.4705 29 0.3118 0.3685 0.4325 
0.2407 0.3061 0.3610 0.4226 0.4629 30 0.3063 0.3624 0.4251 
0.2070 0.2638 0.3120 0.3665 0.4026 40 0.2640 0.3128 0.3681 
0.1843 0.2353 0.2787 0.3281 0.3610 50 0.2353 0.2791 0.3293 
0.1678 0.2144 0.2542 0.2997 0.3301 60 0.2144 0.2545 0.3005 
0.1550 0.1982 0.2352 0.2776 0.3060 70 0.1982 0.2354 0.2782 
0.1448 0.1852 0.2199 0.2597 0.2864 80 0.1852 0.2201 0.2602 
0.1364 0.1745 0.2072 0.2449 0.2702 90 0.1745 0.2074 0.2453 
0.1292 0.1654 0.1966 0.2324 0.2565 100 0.1654 0.1967 0.2327 
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PERCENTAGE POINTS OF STUDENT’S ¢-DISTRIBUTION 


The values in the table are those which a random variable with Student’s ¢-distribution on 


v degrees of freedom exceeds with the probability shown. 


Vy 0.10 0.05, 0.025 0.01 0.005 
1 3.078 6.314 12.706 31.821 63.657 
2 1.886 2.920 4.303 6.965 9.925 
3 1.638 2.353 3.182 4.541 5.841 
4 1.533 2.132 2.776 3.747 4.604 
a) 1.476 2.015 2.571 3.365 4.032 
6 1.440 1.943 2.447 3.143 3.707 
7 1.415 1.895 2.365 2.998 3.499 
8 1.397 1.860 2.306 2.896 3.355 
9 1.383 1.833 2.262 2.821 3.250 

10 1,372 1.812 2.228 2.764 3.169 
11 1.363 1.796 2.201 2.718 3.106 
12 1.356 1.782 2.179 2.681 3.055 
13 1.350 1.771 2.160 2.650 3.012 
14 1.345 1.761 2.145 2.624 2.977 
15 1.341 1.753 2.131 2.602 2.947 
16 1.337 1.746 2.120 2.583 2.921 
ily; 1.333 1.740 2.110 2.567 2.898 
18 1.330 1.734 2.101 2.552 2.878 
19 1.328 1.729 2.093 2.539 2.861 

20 1325 1.725 2.086 2.528 2.845 

21 1.323 1.721 2.080 2.518 2.831 

22 1.321 1.717 2.074 2.508 2.819 

23 1.319 1.714 2.069 2.500 2.807 

24 1.318 1.711 2.064 2.492 2.797 

25: 1.316 1.708 2.060 2.485 2.787 

26 1.315 1.706 2.056 2.479 2.779 

27 1.314 1.703 2.052 2.473 2.771 

28 1.313 1.701 2.048 2.467 2.763 

29 1.311 1.699 2.045 2.462 2.756 

30 1.310 1.697 2.042 2.457 2.750 

32 1.309 1.694 2.037 2.449 2.738 

34 1.307 1.691 2.032 2.441 2.728 

36 1.306 1.688 2.028 2.435 2.719 

38 1.304 1.686 2.024 2.429 2.712 

40 1.303 1.684 2.021 2.423 2.704 

45 1.301 1.679 2.014 2.412 2.690 

50 1.299 1.676 2.009 2.403 2.678 

55 1.297 1.673 2.004 2.396 2.668 

60 1.296 1.671 2.000 2.390 2.660 

70 1.294 1.667 1.994 2.381 2.648 

80 1.292 1.664 1.990 2.374 2.639 

90 1.291 1.662 1.987 2.369 2.632 

100 1.290 1.660 1.984 2.364 2.626 

110 1.289 1.659 1.982 2.361 2.621 

120 1.289 1.658 1.980 2.358 2.617 
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Appendix 


PERCENTAGE POINTS OF THE F-DISTRIBUTION 


The values in the table are those which a random variable with the F-distribution on v, and v, degrees of 
freedom exceeds with probability 0.05 or 0.01. 


Probability |p, 1 y) 3 4 5 6 8 10 12 24 ore) 
1 | 1614 199.5 215.7 2246 230.2 234.0 2389 241.9 243.9 249.1 254.3 
2] 1851 19.00 19.16 19.25 1930 19.33 19.37. 1940 1941 19.46 19.50 
3} 10.13 9.55 9.28 9.12 9.01 8.94 8.85 8.79 8.74 8.64 8.53 
4 7.71 6.94 6.59 6.39 6.26 6.16 6.04 5.96 5.91 5.77 5.63 
5 6.61 5.79 5.41 5.19 5.05 4.95 4.82 4.74 4.68 4.53 4.37 
6 5.99 5.14 4.76 4.53 4.39 4.28 4.15 4.06 4.00 3.84 3.67 
7 5.59 4.74 4.35 4.12 3.97 3.87 3.73 3.64 3.57 3.41 3.23 
8 5.32 4.46 4.07 3.84 3.69 3.58 3.44 3.35 3.28 3.12 2.93 
9 5.12 4.26 3.86 3.63 3.48 3.37 3.23 3.14 3.07 2.90 2.71 
10 4.96 4.10 3.71 3.48 3.33 3.22 3.07 2.98 2.91 2.74 2.54 
0.05 11 4.84 3.98 3.59 3.36 3.20 3.09 2.95 2.85 2.79 2.61 2.40 
12 4.75 3.89 3.49 3.26 3.11 3.00 2.85 2.75 2.69 2.51 2.30 
14 4.60 3.74 3.34 3.11 2.96 2.85 2.70 2.60 2.53 2.35 2.13 
16 4.49 3.63 3.24 3.01 2.85 2.74 2.59 2.49 2.42 2.24 2.01 
18 4.41 3.55 3.16 2.93 2b d 2.66 2.51 2.41 2.34 2.15 1.92 
20 4.35 3.49 3.10 2.87 2.71 2.60 2.45 2.39 2.28 2.08 1.84 
25 4.24 3:39 2.99 2.76 2.60 2.49 2.34 2.24 2.16 1.96 1.71 
30 4.17 3.32 2.92 2.69 2.53 2.42 2.27 2.16 2.09 1.89 1.62 
40 4.08 3.23 2.84 2.61 2.45 2.34 2.18 2.08 2.00 1.79 1.51 
60 4.00 3.15 2.76 2:93 2.37 2.25 2.10 1.99 1.92 1.70 1.39 
120 3.92 3.07 2.68 2.45 2.29 2.18 2.02 1.91 1.83 1.61 1.25 
oo 3.84 3.00 2.60 2.37 2.21 2.10 1.94 1.83 1.75 1.52 1.00 
1 | 4052. 5000. 5403. 5625. 5764. 5859. 5982. 6056. = 6106. = 6235. 6366. 
2] 98.50 99.00 99.17 99.25 99.30 99.33 99.37 99.40 99.42 99.46 99.50 
3 | 34.12 30.82 2946 28.71 28.24 2791 27.49 27.23 27.05 2660 26.13 
4) 21.20 1800 1669 15.98 15.52 15.21 1480 1455 1437 13.93 13.45 
S| 16.26 13.27 12.06 11.39 10.97 10.67 10.29 10.05 9.89 9.47 9.02 
6 | 13.70 10.90 9.78 ele) 8.75 8.47 8.10 7.87 Tdl2 esl 6.88 
7 | 12.20 9.55 8.45 7.85 7.46 7.19 6.84 6.62 6.47 6.07 5.65 
8 | 11.30 8.65 7.59 7.01 6.63 6.37 6.03 5.81 5.67 5.28 4.86 
9 | 10.60 8.02 6.99 6.42 6.06 5.80 5.47 5.26 S11 4.73 4.31 
10 | 10.00 7.56 6.55 5.99 5.64 5.39 5.06 4.85 4.17 4.33 3.91 
0.01 11 9.65 7.21 6.22 5.67 5.32 5.07 4.74 4.54 4.40 4.02 3.60 
12 9.33 6.93 5.95 5.41 5.06 4.82 4.50 4.30 4.16 3.78 3.36 
14 8.86 6.51 5.56 5.04 4.70 4.46 4.14 3.94 3.80 3.43 3.00 
16 8.53 6.23 3.29 4.77 4.44 4.20 3.89 3.69 3.55 3.18 2.75 
18 8.29 6.01 5.09 4.58 4.25 4.01 3.71 3.51 3.37 3.00 2.57 
20 8.10 5.85 4.94 4.43 4.10 3.87 3.56 3:37 3.23 2.86 2.42 
25 7.77 5.57 4.68 4.18 3.86 3.63 3.32 3.13 2.99 2.62 2.17 
30 7.56 5.39 4.51 4.02 3.70 3.47 3.17 2.98 2.84 2.47 2.01 
40 7.31 5.18 4.31 3.83 3.51 3.29 2.99 2.80 2.66 2.29 1.80 
60 7.08 4.98 4.13 3.65 3.34 3.12 2.82 2.63 2.50 2.12 1.60 
120 6.85 4.79 3.95 3.48 3.17 2.96 2.66 2.47 2.34 1.95 1.38 
ed 6.63 4.61 3.78 3.32 3.02 2.80 2.51 2.32 2.18 1.79 1.00 


If an upper percentage point of the F-distribution on v, and v, degrees of freedom is f, then the 


corresponding /ower percentage point of the F-distribution on v, and v, degrees of freedom is 
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Z 


POISSON CUMULATIVE DISTRIBUTION FUNCTION 


The tabulated value is PLY < x), where X has a Poisson distribution with parameter A. 
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0.6968 
0.7916 
0.8645 
0.9165 
0.9513 


0.9730 
0.9857 
0.9928 
0.9965 
0.9984 


0.9993 
0.9997 
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Appendix 


FORMULAE 


PROBABILITY 
P(A’) = 1 — P(A) 

P(A U B) = P(A) + P(B) — P(A NB) 
P(4 2 B) = P(A)P(BIA) 


P(BIA) P(A) 
P(BIA) P(A) + P(BIA)) P(A) 


P(AIB) = 


For independent events A and B, 
P(BIA) = P(B) 

P(AIB) = P(A) 

P(A B) = P(AP(B) 


STANDARD DEVIATION 
Standard deviation = \ variance 
Interquartile range = IQR = Q,; - Q, 


For a set of n values x), X), ... Xj... X, 


n 


2 
ee = aes - x) = aoe ~ eas 


om xe 
Standard deviation =| —- or 2 - x? 


n 


DISCRETE DISTRIBUTIONS 
For a discrete random variable X taking values x, with probabilities PLY = x,) 
Expectation (mean): E(X)=y =) x,P(X=x;,) 
Variance: Var(X) = 0? = >0(x, - py P(X = x,) =) x2P(X= x,) - pw? 
For a function g(X): — E(g(X)) = o9(x,) P(X = x;) 
The probability generating function of X is G,(t) = E(t*) and 
E(X) = G’,(1) and Var(X) = G" ,(1) + G'y(1) - [GDP 
For Z = X + Y, where X and Y are independent: G,(t) = G,(t) x Gy(a) 
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Standard discrete distributions 


Appendix 


CONTINUOUS DISTRIBUTIONS 
For a continuous random variable X having probability density function f 


Expectation (mean): 


Variance: 


For a function g(X): 


E(X) =p = J xf(x)dx 


Var(X) = 0? = [(x — p)? f(x) dx = [x2 f(x) dx - Ww? 


E(g(X)) = J g(x)f(x) dx 


Cumulative distribution function: F(x) = P(X S xy) = ‘(0 dt 
(%0) Ls 


Standard continuous distributions 


Distribution of X P(X = x) Mean Variance P.G.F. 
; : n x n-x n 
Binomial B(, p) (")p (1 - p) up np(1 — p) (1 -p+pt) 
Poisson Po(A) ae A A ete) 
Geometric Geo(p) oe 1 l-p ” 
on 1,2, .. POP) p P 1=(1=p)t 
Negative binomial | /x—1)_. -_ r Hl =p) pt : 
oq ie=?p = ; aT aes 
onr,r+1,... r= P Dp 1-(-p)t 


Distribution of X p.d.f. Mean Variance 
1 1(x-H)? 
2 en) 2 
Normal N(y, 07) ae L o 
Uniform (rectangular) 1 1 1 F 
= [a, b] bia sla +b) aa(b a) 
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Appendix 


CORRELATION AND REGRESSION 


For a set of 1 pairs of values (x, y) 
Sy= lea ee _ Rex? 

eos ee yi 

Se he eee 


The product moment correlation coefficient is: 


Sx _ Di — X)(vi- Jy) _ yy, = (Deen) 


Suds i Re MEOW) Ky DoxF Da \(yy3 ary ] 


Sx Li — Wi 7) 


The regression coefficient of y on x is b = 


Day 7 > = xy 
Least squares regression line of y on x is y= a+ bx where a= y — bx 
(Ss5)" 


Residual sum of squares (RSS) = S,,, - == S,y(1 - r?) 


XX 6oa 


n(n? — 1) 


Spearman’s rank correlation coefficient is r, = 1 - 


EXPECTATION ALGEBRA 

For independent random variables Y and Y 
E(XY) = E(X)E(Y) 
Var(aX + bY) = a? Var(X) + b? Var(Y) 


222 


Appendix 


SAMPLING DISTRIBUTIONS 


Tests for mean when o is known 


For a random sample Y,,X,,...,X,, of n independent observations from a distribution having mean pu 
and variance o” 

— —_ 2 

X is an unbiased estimator of w, with Var(X) = — 


oe ee a 
S’ is an unbiased estimator of 07, where S* = — 


M4 


For a random sample of n observations from N(, 07) 


Bo igo. 
o//nN (0, 1) 


For a random sample of n, observations from N(y,, 7;) and, independently, a random sample of n, 
observations from N(,, 77), 


(¥- Y)-(u,.-4,) 
—~ N(0, 1) 
oo; 
1, * 1, 


Tests for variance and mean when o is not known 


For a random sample of 1 observations from N(u, a7) 
we DS . 5 


2 Xn-1 


Ua ~ t,; (also valid in matched-pairs situations) 
For a random sample of n, observations from N(y,, 0%) and, independently, a random sample of n, 
observations from N(p,, 05), 

S?/a2 


— iw F 
S}/o}, 


n-l,n-1 


If 0% = 0}, = 0? (unknown) then 


(X¥- Y)-(u,.-4,) (n, - 1) S? + (n, - 1) 8? 
~t ntn,-2 where SF = 


1 1 Ht n= 2 
Sin.) 
NON-PARAMETRIC TESTS (0,- Ey 
Goodness-of-fit tests and contingency tables: 2) ae le 


1 
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CHAPTER 1 
Prior knowledge check 


1 


a Every minute the reaction produces 3.85 g more of 
a product. 
b i is reliable as 5 lies inside the range of the data. 
ii is unreliable as 10 lies outside the range of the 
data. 
iii is unreliable as the regresson equation should 
only be used to predict a value of p given ¢. 


Exercise 1A 


1 


2 
3 
4 


av 


10 


11 


12 


a=-3,b=6 

y=-144+5.5x 

y=2x 

a €=2.5,9 =12,S,,=5,S,,= 20 

b y=24+4x 

a S,, = 40.8, S,, = 69.6 b y=-0.294 + 1.71x 
y =—-59 + 57(6) = 283 


a 
b_ For each dexterity point, productivity increases by 57. 
c i 


No, because this is extrapolation as it is outside 
the range of data. 
ii No, because this is extrapolation as it is outside 
the range of data. 
g =1.50 + 1.44h 
a p=65.4-1.38w 
b w=47.4-0.72p 
c The gradient of the second regression line is 
calculated using different summary statistics rather 
than just the reciprocal of the summary statistics 
used for the first regression line. 


d i The first one. ii The second one. 
a y=78.0-0.294x 
b yk 

60 

50 - 

40 Sa a 

30 > 4 —_ 

x. 
= x = 78.0 — 0.294% 
10 
0 
0 50 100 150 200 250% 


ce Model is not valid since data does not follow a 
linear pattern. 

A Sin = 6486, Sp = 6344 

b p=21.0+0.978n 

c £60100 (3 s.f.) 

d_ Reliable as 40 000 items lies inside the range of the 
data. 

@ Sin = 589.6, Syp = 1474 

b p=20+2.5n 

c The increase in cost, in pounds, for every 
100 leaflets printed. 

d t>8 

a y= -0.07 + 1.45% 

b Number of years protection per coat of paint. 

c Unreliable as 7 coats lies outside the range of the 
data. 


d 
e 


Answers 


10.08 years 

i 0.4779 + 1.247x 

ii 9.2 years (2 s.f.) 

iii The answer now uses interpolation not 
extrapolation and the number of data points 
has increased, which increases accuracy in 
prediction. 


Exercise 1B 


1 


au PWD 


y=6-x 

s=88+p 

y = 32 - 5.33x 

t=9+4+3s 

a y=3.54+0.5% b d=35+4 2.5¢ 

a Sy = 162.2, S. = 190.8; y = 7.87 + 0.850x (3 s.f.) 
b c= 22.3 + 2.13a (3 sf.) 

c £90.46 or £90.56 

a p=3.03+1.49v (3s-f.) b 10.1 tonnes (3 s.f.) 


Exercise 1C 


1 Residuals: 0.07, -0.496, 0.471, p - 20.728, 
-0.094 (3 s.f.) hence p = 20.8 (3 s.f.) 


2 


a -0.99, -1.025, 1.94, 1.595, -0.095, -1.475 
b 
A 
x 
x 
x > 
2 4 6 8 10 7 12 14 | 16 + 
x x 
x 

c No-residuals not randomly scattered about zero. 


sao 


egaomereanrs & 


fe>) 


(enc) 


0.8892, —0.2956, 0.5196, 0.9888, 0.2732, -5.2576, 
1.496, 0.9036, 0.7804, -0.3428 

The outlier is (7.2, 84) 

Yes: She may have just had a bad day; No: It could 
have been incorrectly recorded. 

p= 51.6 + 5.31¢ (3 s.f.) 

93% 

—-0.176, 0.234, 0.048, -—0.038, -0.124, a- 7.216, 
0.402, a = 6.87 

Yes: Residuals are randomly distributed about zero. 
The RSS measures the reasonableness of linear fit. 


0.949 (3 s.f.) 
Obstacle course — lower RSS 
0.100 (3 s.f.) b October — lower RSS 


y = 17.55 - 3.117" 

The value/price of the car brand new (£17 550) 
£11316 (nearest pound) 

0.0319, 0.0021, 0.1606, —0.2809, —0.0107, 
0.0946 

Suitable since residuals are close to zero and 
scattered about zero. 

0.1148 (4 d.p.) 

First sample since RSS is smaller. 
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Answers 


Challenge 
p= 13,q = 36 
Mixed exercise 1 
1a t¢=1.96+0.95s b 49.5 (3 s-f.) 
2 S,. = 6.429, Svy = 11.68 b y=0.554 4+ 1.82x 


a 
c 6.01cm (3 s.f.) 

3 a S,,= 16350, S,, = 210331 

b y= 224.5 + 12.86x 

e 1511 

d 255 

e This answer is unreliable since a Gross National 

Product of 3500 is a long way outside the range of 

the data. Also, the regression equation should only 

be used to estimate values of y given x. 


4a y=0.343 + 0.449% b ¢=2.344+0.224m 
c 4.6cm (2 s.f.) 
5 a S,,= 90.9, S,, = 190 b y=-1.82 + 2.09x 
ce p=25.5 + 2.09r d 71 (2s.f.) 
e This answer is reliable since 22 breaths per minute 
is within the range of the data. 
6 a 0.79kg is the average amount of food consumed in 


1 week by 1 hen. 


b 23.9kg (3 s.f.) c £47.59 
7a&e 
350 
300 
x 
B® 250 
n 
n 
Ey 
= 200 
150 
100 
100 125 150 175 200 225 250 
Body length (mm) 


b_ There appears to be a linear relationship between 

body length and body mass. 

w= -12.7 + 1.981 

y =-127 + 1.98% 

290 ¢ (2 s.f.). This is reliable since 210 mm is within 

the range of the data. 

g Water voles B and C were probably removed from 
the river since they are both underweight. Water 
vole A was probably left in the river since it is 
slightly overweight. 

@ Sy = 78, Sex = 148 

c w=816.2 + 210.8n 

e 100 items is a long way outside the range of the data. 

f -0.311, 0.108, 1.36, -0.946, -1.69, 0.527, 0.946 

g 

h 


mao 


b y=7.311 + 0.5270x 
d 5032kg 


-124, 43.4, 546, -378, -675, 211, 379 
They are related by the same code as that used for 
y in terms of w. 
a 0.133, 0.422, 0,-0.711, -0.422, 0.156, 0.445 
b No -residuals not randomly scattered about zero. 
ce 1.10 (3 s.f.) 
d The second one — lower RSS 


10 a S,,= 39.315, S,= -27.9225 
[= 8.48 - 0.710s 

3.2mm (2 s.f.) 

0.347 


11 


12 


iw 


-— oa 


aone 


0.37 

No - residuals go negative, positive, negative, so not 
randomly scattered. 

y = 1.61 + 1.25x 


(9, 11) 

i Ignore: recording error; Do not ignore: could be 
a ‘runt’. 

ii y= 1.14 41.38% 

iii 28.7 grams (3 s.f.) 

iv Unreliable as 20 days lies outside the range of 
the data. 

s = -0.215 + 1.09¢ 
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-0.117 (3 s.f.) 

Linear model is suitable since residuals are 

randomly scattered around zero. 

0.0516 (3 s.f.) 

The UK sample since RSS is smaller. 


Challenge 
a Ye, = dy; - a — bx) 


= (y; - (7 — bX) - bx) 

= dy, - 7 + ox - Vida; 
=doy;-9>-1 + b¥DN1 - Vx, 
= doy, -— ny + nb¥ — ox; 
=>y,- nmi + nb! - bx, 
= Lyi - Ly, + bx; - Lbx; 
=0 


b_ The property doesn’t take into account if residuals 


are randomly scattered about 0 and are close to 0. It 
doesn’t take into account outliers and doesn’t mean RSS 
is close to 0. 


CHAPTER 2 
Prior knowledge check 
1 a Strong negative correlation 


2a 52 b 84 


b 


As the age of the car goes up, the value goes down. 
c 64 


Exercise 2A 

1 0.985 (3 sf) 
2 0.202 (3 s.f.) 
3 a 9.71 (3 s.f.) 


b 
c 


0.968 (3 s.f.) 
There is positive correlation. The greater the age, 
the taller the person. 


' Online ) Full worked solutions are available in SolutionBank. oe 


of 


5 a 
b 


NN 


a 


b 


anno mp 


10 a 


1l a 


Sy = 30.3, S, = 25.1, S, = 25.35 

0.919 (3 s.f.) 

The value of the correlation coefficient is close to 1 
and the points lie on an approximate straight line, 
therefore a linear regression model is suitable. 
0.866 (3 s.f.) 

There is positive correlation. The higher the IQ, the 
higher the mark in the general knowledge test. 


0.973 


p|o]}] 5] 3] 2/1 
g | 0 |17|12] 10] 6 


0.974 (3 s.f.) ce 0.974 (3 s.f.) 
Sop = 10, Sy = 5.2, Sp = 7 

0.971 (3 s.f.) 

0.971 (3 s.f.) 

Sez = 1601, Sy = 1282, S,y = -899 

-0.627 (3 s.f.) 

The shopkeeper is wrong. There is negative 
correlation. Sweet sales actually decrease as 
newspaper sales increase. 


spe ert 22 ome 


n 
2 
= 10033x? - a - 100( 32x : 
= 100S,, = 100 x 111.48 = 11148 
0.934 (3 s.f.) 
The PMCC suggests strong linear correlation but the 
scatter diagram suggests non-linear fit so a linear 
regression model is not suitable. 


2 
x 
Sex = Dx? - a) a 1.448... 


Sry = doxy > 2d 


oa 


= 22.02 - 


12 x 97.7 
7 


= 180.37 - 


= 12.884... 


(Sy) 97.72 
Syy = LY? - —— = 1491.69 - = 
= 128.077... 


Sr 
r _— 12.884... = 0.946 (3 s.f) 


(SoS ¥1448.. % 128.077... 
—2.29345, 0.22765, 0.8382, 1.16985, 1.39095, 
0.61205, -1.94575 
Residuals are not randomly scattered about zero 
(they ‘rise and fall’) so this indicates that a linear 
model is not a good model for this data. 


Exercise 2B 


la 
b 


The data clearly follows a linear trend. 
Spearman’s rank correlation coefficient is easier to 
calculate. 


2 The relationship is clearly non-linear. 

3 The number of attempts taken to score a free throw 
is not normally distributed (it is geometric) so the 
researcher should use Spearman’s rank correlation 
coefficient. 


4a 


b 


Yd? = 10, r, = 0.714... limited evidence of positive 
correlation between the pairs of ranks 

Yd? = 18, r, = 0.8909... evidence of positive 
correlation between the pairs of ranks 


an 
ae ao f 


Answers 


Yd? = 158, r, = -0.8809.... evidence of negative 
correlation between the pairs of ranks 


1 b -1 
0.9 d 0.5 
Yd? = 48 


r, = 0.832.... The more goals a team scores, the 
higher they are likely to be in the league table. 


7 Yd? = 20, r, = 0.762... The trainee vet is doing quite 
well as there is a fair degree of agreement between the 
trainee vet and the qualified vet. The trainee still has 
more to learn as r, is less than 1. 


8a 


10 a 


The marks are discrete rather than continuous 

values. 

The marks are not normally distributed. 

Yd? = 28, r, = 0.8303... 

This shows a fairly strong positive correlation 

between the pairs of ranks of the marks awarded 

by the two judges so it appears they are judging the 

ice dances using similar criteria and with similar 

standards. 

Give each of the equal values a rank equal to the 

mean of the tied ranks. 

The emphasis here is on ranks/marks so the data 

sets are unlikely to be from a bivariate normal 

distribution. 

0.580 

Both show positive correlation but the judges agree 

more on the second dive. 

0.971 (3 s.f.) 

i No change since the rank does not change. 

ii Will increase since d = 0 and change in }-@? is 
zero but n increases. 

Use PMCC with tied ranks given mean of ranks. 


Exercise 2C 


la 
b 


PMCC = -0.975 (3 s.f.) 

Assume data are normally distributed. Critical 
values are +0.8745. -0.975 < —0.8745 so reject Hy. 
There is evidence of correlation. 

r = 0.677... 

Assume data are jointly normally distributed. 

Hy: p = 0; Hy: p > 0, 5% critical value is 0.5214. 
Reject Hy. There is evidence to suggest that the 
taller you are the older you are. 

Ho; ps = 0; Hy: ps 4 0. Critical region is r, < -—0.3624 
and r, > 0.3624 

Reject Hy. There is reason to believe that engine size 
and fuel consumption are related. 

Hy: p = 0, Hy: p > O 

Critical value = 0.7887 

Since 0.774 is not in the critical region there is 
insufficient evidence of positive correlation. 


Yid?=10 
6x 10 
ea = 0.881 (3 sf. 
" 8 x 63 ee) 


e.g. The data are discrete results in a limited range. 
They are judgements, not measurements. It is also 
unlikely that these scores will both be normally 
distributed. 

Ho: p = 0, Hy: p > O 

Critical value: 0.8333 

Since 0.8333 is in the critical region there is 
evidence of positive correlation. 
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Answers 


Ranks are given rather than raw data. 
b \id?= 46 
_ 6 x 46 


Ps B68 
r, = 0.452 

c Ho; p, = 0; H,: p, 4 0; critical values are + 0.6249 
0.452 < 0.6249 or not significant or insufficient 
evidence to reject Ho. There is no evidence of 
agreement between the two judges. 

6 Yd? = 76, r,= 0.357 (3 s.f.) Ho: ps = 0; Hy : p, < 0. 
Critical value = —0.8929. There is no reason to reject 
Ho. There is insufficient evidence to show that a team 
that scores a lot of goals concedes very few goals. 

7a Yd? =2,r,= 0.943 
b Ho: p, = 0; Hy: p, > 0. Critical value = 0.8286. 

Reject Hy. There is evidence that profits and takings 

are positively correlated. 

Vid? = 58, r, = 0.797... 

b Ho: p, = 0; Hy: p, 4 0. Critical values = +0.5874. 
Reject Hy: On this evidence it would seem that 
students who do well in mathematics are likely to 
do well in music. 

9 Using Spearman’s rank correlation coefficient: 

Vid? = 54, r, = 0.6727... 

Ho: p, = 0; H,: p, > 0. Critical value 0.5636. 

Reject Hy: the child shows some ability in this task. 

10 Using Spearman’s rank correlation coefficient: 

Vid? = 64, r, = -0.829 (3 d.p.) 

Ho: p; = 0; Hy: p, 4 0. Critical values = +0.8857. 

Do not reject Hy: There is insufficient evidence of 

correlation between crop yield and wetness. 

11 a PMCC since data is likely to be bivariate normal. 

b 2.5% ce 11 


Mixed exercise 2 
1a -0.147 (3 s-f.) b -0.147 (3 s.f.) 

c This is a weak negative correlation. There is little 
evidence to suggest that science marks are related 
to art marks. 

Sy= 4413, Spp = 5145, Sp = 3972 
0.834 (3 s.f.) 
c There is strong positive correlation, so Nimer is 


correct. 
(Sp)° (Sx - 10)" 
Spp = LP* — = — = Via - 1 = 


(Sx) - 10n)" 


n 


af 


= So(v? - 20x + 100) - 


=x? - 20)0x + 100n 


7 (Vr — 20n fx + ua 


n 


=x? - 200% + 1007 
- (eau 


n 


-20¥ ts 100n) 


2 
=>ox?- O) = Sie 

b -0.964 (3 s.f.) c -0.964 (3 s.f.) 

d The PMCC suggets strong (negative) linear 
correlation but the scatter diagram suggests 
non-linear fit so a linear regression model is not 
suitable. 
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10 


11 


12 


13 


an f 


Data is given in ranks rather than raw scores. 

r,= 0.648 (3 s.f.) 

The null hypothesis is only rejected in favour 

of the alternative hypothesis if by doing so the 
probability of being wrong is less than or equal to 
the significance level. 

Ho: p; = 0; H,: p, > 0. Critical value at 5% is 0.5636, 
so reject Hy. There is evidence of agreement 
between the two judges. 

Data given in rank/place order. The populations are 
not jointly normal. The relationship between data 
sets is non-linear. 

yd? = 28, r, = 0.766... 

Ho: p = 0, Hy: p > 0, 2.5% critical value = 0.700 
Reject Hy. There is evidence of agreement between 
the tutors at the 2.5% significance level. At the 1% 
significance level the test statistic and critical value 
are very close so it is inconclusive at this level of 
significance. 

Sod? = 56, r, = 0.66 

Ho: p; = 0, Hy: p, > 0, critical value = 0.5636 

Reject Hy. There is a degree of agreement between 
the jumps. 


Ranking so use Spearman’s, r, = 0.714. Ho: p, = 0, 

H,: p, > 0. Critical value for 0.025 significance level = 
0.7857. Do not reject Ho. There is insufficient evidence 
that the expert can judge relative age accurately. 


a 
b 


i") 


b 


PMCC = 0.375 (3 s.f.) 

Ho: p = 0; Hy: p > 0. Critical value = 0.3783. 

Do not reject Hy. There is insufficient evidence of 
postive correlation between distance and time. 
Both distance and time are normally distributed. 
Yd? = 36, r, = 0.5714... 

Ho: p = 0, H,: p 4 0, critical values = +0.7381 

No reason to reject Ho. Students who do well in 
geography do not necessarily do well in statistics. 
0.7857 (4 d.p.) 

Critical values = +0.7381 

Reject Hy. There is evidence to suggest correlation 
between life expectancy and literacy. 

Only interested in order OR Cannot assume 
normality. 

i No change 

ii Would increase since d = 0 and n is bigger. 
Use PMCC with tied ranks given mean of those 
ranks. 

r, = 0.314 (3 s.f.) 

Ho: ps = 0; Hy: p, > 0. Critical value = 0.8286. 

Do not reject Hy. There is insufficient evidence of 
agreement between the rankings of the judge and 
the vet. 

Siz = 1038.1, Sy = 340.4, S,, = 202.2, 

r = 0.340 (3 d.p.) 

One or both given in rank order. 

Population is not normal. 

Relationship between data sets non-linear. 

Sid? = 112, r, = 0.321 (3 d.p.) 

Ho: ps = 0, Hy: p, 4 0, critical value = +0.6485 

Do not reject Hy. There is insufficient evidence of 
correlation. 

Syx = 6908.1, Sy, = 50 288.1, Sx = 17 462, 

r = 0.937 (3 d.p.) 

Sid’? = 4, r, = 0.976 (3 dp.) 


' Online ) Full worked solutions are available in SolutionBank. 


c Ho: p, = 0, Hy: p, 4 0, critical values = +0.6485 
Reject Hy. For this machine there is insufficient 
evidence of correlation between age and 
maintenance costs. 


14 a PMCC =-0.975 (3 s.f.) 
b Ho: p= 0, H;: p < 0, critical value = -0.5822, reject 
Hy. 
The greater the altitude the lower the temperature 
c Hp: p, = 0 (no association between hours of 
sunshine and temperature); H,: p, 4 0, critical value 
= +0.7000 
0.767 > 0.7000 so reject Hy. There is evidence of a 
positive association between hours of sunshine and 
temperature. 

15 a Youuse arank correlation coefficient if at least one 
of the sets of data isn’t from a normal distribution, 
or if at least one of the sets of data is already 
ranked. It is also used if there is a non-linear 
association between the two data sets. 

Sod? = 78, r, = 0.527... 

c Ho: p, = 0; Hy: p, > 0. Critical value = 0.5636. 
Do not reject Hy. There is insufficient evidence of 
agreement between the qualified judge and the 
trainee judge. 

16 Soa? = 120, r, = 0.4285.. 


There is only a small eee of positive correlation 
between league position and home attendance. 

17 a H,: p=0; H,: p > O. Critical value 0.5822. 
0.972 > 0.5822. Evidence to reject Hy. Age and 
weight are positively associated. 

6 x 26 
25 =l1- = 0.783 (3 s.f. 
b Sid’? =26,r,=1 ox EO 3 (3 s.f.) 
c Critical value = 0.6000 
0.783 > 0.600 is evidence that actual weight and 
the boy’s guesses are associated. 


Challenge 
a Since there are no ties, both the x’s and the y’s consist 
of the integers from 1 to n. 


Six? =Ty? =S7r? 7 n(n + uen +1) 
r=1 


b Since there are no ties, the x,’s and y,’s both consist of 
the integers from 1 to n. Hence \>_(x; - ¥2  (y; - 7)? 


=) @;- x)! = d@; - x)? 


n(n + 1) 
= 2 _nt+l 
Th n 2 


D(a; — ¥)? = (a? - 2x, + ¥) 
= Sox? - 245° x, + nX? 
n(n + 1)(2n+1) | 2m +1) 


inst) «fat 


6 2 
= Mas ned n( + Ay 
2 
= an+1 n+ 
wa ft G | 
eS ee 
12 
_—ntnt+V(n-1)_ n(n? -1) 
12 12 


Answers 


© did? = diy: - «9? 
= dix? — 2. miyi + Ly? 
= 20x? - 230xiy; 
Says = Dee - 
dia; - = Lay - 9) - UY; 


= sie = i — Edy. + nzy 
= Souy; — NX — n¥K + n¥X 
= oxy; -— nXX 


dj? 
= >ox? - a — N¥X 


_n(in+1)(2n+1) nt+1)m+1) did? 


6 4 2 
_n(n+1)(4n+2-3n-3) did? 
12 a 
_nin+1)(n-1)_ did? 
12 2 
_n(n?-1) did? 
~ 12 2 
e From parts b andd 
n(n?-1) did? 
Lei-MNWi-7  _~ 12 2 
(Xe-z?yi- 7? = (nt = 1) 
12 
n(n? — 1) did? 
a | 
n(n?-1) n(n? -1) 
12 12 
_ 6d id? 
~~ n(n? - 1) 
CHAPTER 3 
Prior knowledge check 
1a k=3 b PX>5)=7% ¢ EX)=5 
2a 4 b 3 c 48 
3 a=6 


Exercise 3A 
1 a There are negative values for f(x) when x < 0 so 
this is not a probability density function. 
b Area of 84 not equal to 1 therefore it is not a 
probability density function. 
c There are negative values for f(x) so this is not a 
probability density function. 
2k=3 


3 a f(x) 


(i Ssereateseueeces 
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Answers 


s fs fs & 


10 a k= 


lia k= 


12a 


230 


RY 


Challenge 
a k=2 
+ 8 1 
b a) Ul x00 
ce p=2.5 
Exercise 3B 
- 0 «<0 
xe 
F(x) = g «OS%S2 
1 MS 2 
2 0 x<1 
Pn a 
F(x) = aa 1<x<3 
1 £>3 
3 0 x <0 
2 
— 7 O<x<3 
2%: x 
1 x>6 
4a f(xa 
5k+------------- 
b k=$ 
. 0) x<0 
a 0<x<3 
F(x) = 
x 5x 
ae me 3<x<5 
1 go> 5 
5 2x 
f(x) = 5 2<x<3 
0 otherwise 
6 a 0.75 b 0.75 ce 0.5 


7 42°4+q=0 (1) and 247+q=1 (2) 
(2) - (1): £4” - 2.2? = 1 = 2 — 2?=6 


Let y = 2’, then y?-y-6=0=> (y - 3)(y+2)=0 


Taking the positive value, y = 3 > 2? =3 


_In3 
In2 
From (1), g = -$ 
8 
7 3y2_o¢41 1<x<2 
f(x) = 2 2 
0 otherwise 
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3 
© 46 
I 
I 
I 
I 
2 43 7 
9a [4-29 dx =1=[i(4x-2)| 7 
0 
16k 3 
77° lsk= 
b 0 x<0 
Fa)= ¢ 3 [ay z) <x< 
3 (ae = ea 8 
1 a> 2 
c 0.007 (1 s.f.) 
10 F(x)>1forx>e 
11 a k=49 b 0.25 
0 
Inx 
12 F(a) ={ —= 1<x<7 
=) in7 . 
1 x>7 
0 G=< 0 9 
13 F(x) =(sin@@x) =O<x<} 
1 % >t 
1 
14 = 
+ he Fas 
b 1 | 1) 
f(x) = Sains\ & Lae 
0 otherwise 
Challenge 
0 t<0 
eer ee 
b 0.2044 (4 d.p.) c 0.0235 (4 d.p,) 10 
Exercise 3C 
3 3 3 
Ta k=; bs Ose 
2a 2.25 b 0.3375 ce 0.581 (3 s.f.) 
3a é b 3 c 0.943 (3 sf) 
d 0.556(3s.f) e 8 f 4 
4a k=2 b ; 
ce Var(X) = E(x”) - (E(x)? 
1 1 2 
= f 2x°(1 - x) dx - (3) 
_. 2x? 2x4)" 1-1 1_1 
3. «4 bs “97-6 97 18 
4 
d 5 
5 a = or 0.3125 b 0.6 or2 
6 f(x) 
& 
8 ae 
f ' 12 
I I 
! I 
3 
! ‘ ! 
13 
-1 O 1 x 


as fs 


a 


0 
Var(X) = E(X?) — (E(X))? 


= [80 + 22) dx - 08 


a(t ae 
aS.” Bla 
=3(3+3~(-3)- (4) =04 
0.538 (3 s.f.) 
k=} = 
R(T) =4 ae =4{E) =ix #2=16 
0) 


3.417 © 1.0152 d 1.01 
fla)a 
0.54 

O12 5 «= 
. © ae 

10 10 
I ke de =1 = [ke = 1000k 

5 3 a 

3 

> k==%, = 0.003 
75 © 3.75 d 0.386 
fla)a 


Answers 
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Answers 


a 1 [gti $l beagle 
= glinx - In(3 - x)? = $02 in2) = © ind 
sc = 

b E(X) = 1.5, Var(X) = 0.0860 

15-3 

Challenge 


Var(X) = ta — p)* f(x) dx 
= [ie — 2x + py \f(x) dx 
= [i x?Mtx) dx — 2u[xflx) dx + y2 ffx) dx 
2 [i x?M) dx — Qu) + .2(1) 


= [ x? fle) dx - p22 


= E(X?) - (EC)? 


Exercise 3D 
1 a f(x) 


0.3 


O 4 x 


b The mode is 1. 


0 x<0O 
1 
2 F(x) = {2 <x<4 
a F(x) 16° O<x 
1 x>4 
b i 2.83 ii 1.26 iii 3.58 
3 a Median = 1.732 since -1.732 is not in the range. 
b Q, =1.225, Q; = 2.134, IQR = 2.134 — 1.225 = 0.909 
4a_ f(x) 
il 
O 2 x 
b 0 
¢ 0 <0 
F(x) = x- 4a? O<x<2 
1 x>2 
d Median = 2 - V2 = 0.586 (3 s.f.) as 2 + v2 is notin 
range. 
el 


f 0.0506 (3 s.f.) 
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8 


ig 


mo 


= OO 


a 


a 


b 


f(y) 
1 
2 
i 
6 1 
I 
| 
O 3 y 
Positive skew 
0 
0 y<0O 


Median = 9=3v5 =1.15(3s.f) 


2.28 (3 sf.) 
f(x) a 
2 | 
1 
I 
I 
! 
I 
I 
O 2 x 
2 
0) x<0 
F(a) = gt O<x<2 
1 x>2 
1.68 (3 s.f.) 


f(x) 


=] O 1 x 
Bimodal -1 and 1 
Median = 0 
0 x<-l 
F(x) = qutants -l<x<1 
1 SL 
f(x) 
0.6 
! 
I 
I 
I 
I 
a 
O 2 x 


Negative skew 
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10 


11 


12 


13 


os oan 


ig 


se a0 


ia) 


Mode = 1.5 
0 x<0 
=} One o8 Lae 
F(x) = 30° - 39% O<x<2 
1 GS 2 


F(1.23) = 0.495 and F(1.24) = 0.501. Since 0.5 lies 
between 0.495 and 0.501 the median lies between 
1.23 and 1.24. 


ly 1<x<3 
f(x) = {4 
0 otherwise 


Mode > median so negative skew. 


k=1.9 
12x?(1 -— x) O<xx<l1 
f(x) = 
0 otherwise 

mode = 
0.2853 

0) w<0 
Fw) = W(25 - 4w) O<w<5 

1 w>5 


F(3.4) = 0.487..., F(3.5) = 0.528..., so median lies 
between 3.4 and 3.5. 

The maximum of f(w) is at w = 2, so the mode is 2 
Median < mode so negative skew. 


1.365 (3 d.p.) 
0 x<0 
3 O<x<1 
F(x) = P 
x1 << 
as Re 
1 KS 2 


Median = 1.565 (3 d.p.) IQR = 0.821 (3 d.p.) 
Mean < median so negative skew. 


e.g. fla 
a ee Se 


x 


b eg. 


f(x)a 


O<x<1 


14 eg. f(x) = 1<x<2 


otherwise 
15 a Mode =2 
b 0 


In(% 

F(x) = (3) 
1 x>10 

c 2/5 

d Q, = 2.991, Q; = 6.687, IQR = 3.697 

277 hours 


Answers 


b Q,=115 hours, Q; = 555 hours, IQR = 439 hours 


k=nr 
b 0 x<0 
F(x) = 4 tan(rx) O<x<0.25 
1 x > 0.25 


c 0.1476 


18 a le 


In6 


0 <2 
1 3x 
ae 2<x"4<4 
inal*(z0 - 75) ‘ 
1 x>4 
e 3.101 dp.) f 4 
g Negative skew: mean < median < mode 


F(x) = 


Exercise 3E 


b 0.6 

b 0.39 

b 0.6875 c p=4 
es fz 


Shaded area = 1 - 0.25 - 0.5 
= 0.25 


5 7 b 
b=11 a=3 
5 a Y~U[, 21] b 2 
6 a Continuous uniform distribution 
b E(Y) =6 c 2 d 3 
7a1 b # c 2 


b 3.066(3d.p.) ¢ 0.349(3 dp) 
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Answers 


1 x>5 


a_ E(X) = 3, Var(X) =$ 
a 4.5 b 


d 0 e235 


ioe) 


b E(X) = 2, Var(X) = 8 
c 20% = 20.6 


N=) 


F(x) = = -1.75 3.5<«"<5.5 


1 x>5.5 
a=-landb=3 
5 + (-1) 

2 
_6-CDP_ 
Var(X) = a5 
E(Y) = 4EC(X) - 6 

=8-6=2 
Var(Y) = 16 Var(X) 1 
=48 


10 


11 E&)= 


3 


12 


a 
13 a 


4a 


15 a ers 3 
otherwise 


b f(a)a 


f(x) |= 


16a 3 b Sas 


otherwise 


Mean = 0.5, Variance = 22 


12 
13 
3 


c Uniform d 
17 a 1.5 b 2 c 
d 0.48 e 0.2153 
a a=1.5,8=13.5 
b 


. -_ ss 7 
ic=5.5 ii 37 


18 


Exercise 3F 
1 E(Y) = EX’) = 254 
1 
-— n=" 5<«x<11 
0 otherwise 


c 677cm? 


wlw Re 
eo Sh 


6 
=2 a 
8 
9 


a 


as 


175 <x < 215 


== 
f(x) = {® 


0 otherwise 
f(x) 


0 175.215 me 
i0.3 ii 0 
20 d 189 e 0.1323 (4 dp.) 
a bi c 0.2276 (4 dp.) 
; b 0.2142 (4 d.p.) 
0.6 b 0.3222 (4 dp.) 


Mixed exercise 3 


10 16 26 26 5 
973 bogpo cB 
128 
343 e 0.5 
1 1 5 2 
3 b is C 35 
0 x<0 
F(x) = { 2x - x? O<x<l1 
1 x>1 


Median = 0.293 as 1.71 is not in the range 


a F(2) = 1; Fly) = k(y? - y) 


k(4-2)=1>k=$ 


b 0.375 
c Median = 1.62 as -0.618 is not in the range 
. ya: De7=0 
f(y) = 
0 otherwise 
a 0.648 
b Median = 2.55 as -2.55 is not in the range 
c 2x Q<x<3 
f(x) = 5 
0 otherwise 
aq # e Mode =3 
2 2 
a [kx?dx 1 fe =1 
0 3 Jo 
8k _ 23. 
= 1l=k= 8 
b 1.5 
" 0 x<0 
3 
F(ix)= | > O<x<2 
\ 8 
(1 x>2 
d 1.59(3s.f£) e Mode=2 
a Pky2+ 2y +2)dx=1 
1 


‘ 3 
je(e ty? t+ 2y)| =1 
3 1 


33 
(224346) - kG +142) =1 


62 
aka 
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10 
11 


b 0 y<i1 
y  3y?  38y 5 
Fy) = - ey 
W=\ Gtext3i-31 iss 
1 y>3 
11, 
C Br 
a f(x) 
3 
8 
2 O 2 
Mode = 0 
c 0 x<-2 
-{i2e “@ 1 2,<x<2 
FA)=) 35-32 +5 
1 x>2 
35 
jog OF 0.273 
26 
ao 
b 0 x<0 
3 O<x<1 
F(x) = 
207 5 24 
ap | 
A. x>2 
c i 1.401 ii 0.45 


d Mean < median, negative skew 


F(1) =0 > 0.05a-b=0 
F(2)=1=> 0.05a?-b=1 
0.05(a? -a)=1>a?-a-20=0 
(a+ 4)(a -— 5) =0 


Given that a is positive, a= 5 and b = 0.05a 


F(x) < 0 for8<x< 10 
3 
a [kx -kdx =1 


Ie at 
ae) -(G-a}=1 
3 ak 2 ~ 
2k=1>k=5 
7 
b; 
- 0 x<1 
2 
F(x) = a-5+G 1<x<3 
1 eS 3 
_ 2.47 24, 1_ 
d F(2.4)= 7 5 +7 = 0.49 
_ 2.5? 2.5 1_ 
F(2.5) = 5 +7 = 0.5625 


Since 0.5 lies in between the median is between 2.4 


and 2.5 
e Mean < median, negative skew 
f 1.265 (4s.f.) 


1 


4 


Answers 


12 a f(x)a 
i) 
+ 
(3 
I 
) 
I 
I 
I 
I 
I 
I 
I 
t {$$ 
1 2 ss 
b Mode =1 ce Ul d 1.14 
€ 0 x <0 
2 
5 O<4<1 
F(a) = | 
3 
o+3 1<x<2 
1 x>2 
f Median = 1 
13 a f(xa 
1] 
2 


1 
1 
1 
I 
I 
I 
1 
1 
1 
I 
I 
I 
t 
2 


0 5 * 
b Mode = 2 
c Using the sketch, P(¥ > 2) = area of triangle 
=$x3xi=0.75 
d (0) x<0O 
xt O<sx<2 
F(x) = ee 
10x - x? -13 2<x4<5 
12 
1 x>5 
e 5-6 =2.55 (s.f) 
Li_¢ye 
ade ar 6x? + 30x) 2<x<5 
0 otherwise 
b Mode = 2.5 
c f(x)a 
4] 
9 
+} 
0 
ad 2 
e Fun=#(22) =| 2(22)" + 15(22)'— a4 
we BIL Oe 6 


= 0.5297 (4 d.p.) > 0.5 
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Answers 


f F(2.5) = 0.2284 so as 0.2284 < 0.5 < 0.5297, for 
this distribution mode < median < mean, which is 
positive skew 

F(5)=1> K35x5-2x«5%4=15125k=1 


15 a 


aA. 
= k=a5 


b 2.02 (3 s.f.) 


35 — 4x 
c f(x)= 125 


0) otherwise 


O<x5 


d f@a 
35_ 
125 


Positive skew as mean > median > mode 


Rao mo 


_~3 7_ 5 
16 a=3,6=5 


[Pee + D8 dx =1 easy a -1 
= 4 1 


=lsk=4 


17 a 


=> 


| ar 


b -0.2 
x<-1 


0 
ra) {ia -1<x<0 
1 60) 
-0.159 (3 s.f.) 


fya 
1 


7 


ir] 


a 


18 a 


aq 
Re 
a = 


{ro ore 
dela 


19 -1.5 


oe 
re 
M 
es 
Oe oly 


20 
0.533 (3 s.f.) 

Continuous uniform distribution Y ~ U[O, 10] 
3 


3 
Os 


21 


se oe 


10 
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22 


23 
24 


25 
26 


27 


28 


29 


30 


31 


32 


a 


=") 


F(x) = 


an fs & & 


Continuous uniform distribution X ~ U[O, 20] 


f(x) 


os 
20 


0 20 * 
E(X)=10 Vary) = 408 c 
X~ U[-0.5, 0.5] b 04 ec 4 


a -3<x<10 
fx) = {* 
0 otherwise 


3.5 minutes 


0) x<-3 
+ 


Fa@)= {243 -3<x<10 


13 
1 £>10 


1 
X ~ U[-0.5, 0.5] 


i 
i= | 20 


0) otherwise 


b 0.4 c 0.064 


190 sx < 210 


i 2 ii O ec 10 d 
Continuous uniform distribution 
Normal distribution 
0 
_ 6-0 
216 
1 


1.24 hours (3 s.f.) 


wr 


t<0O 
0<t<6 
t>6 


F@=41 


c 1.5 hours 
1 p<x< 
ta) =| b<x<5b 
0 otherwise 
3b 


2 5b 
po) = [= ipac | 1 pease 


_ 1246? _ 310? 
i2 3 


12 
0.246 (3 s.f.) 


0 Fae | 
In(2x - 1) 
In5 

1 G3. 
k=n 
0.5947 = 59.47% 


k=2 
3 
142In2 = 0.795 (3 s.£) 


0.256 (3 s.f.) 


Challenge 
1a 6~U(0, 27] 


b 


21 — 0.6366r (4 dp.) 
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c 


Spin the spinner 100 times and measure X each 
time. Take the mean of these observations and 
divide 2r by this value. 


2a EX)= [ atte) dx = [te dx 
=[xUe*)]5 - i -e dx 
see 1 1 
=0+f e dx = (0-1) => 
= fg — f ¥2() o~ 
E(x?) = f x fla) da = f xe) dx 
= [x2Qe*)]9 - ii -2xe** dx 
oe ie AX 7 2 
=0+2 [ xe dx = - 
Oe ine 
Var(X) = BQ?) - (Eo)? = 2 - 2 = 
b P(X>a)=1-P(¥<a)=1- [ae dx 
=1-[-e~]p = 1 - (Ce + 1)0 = e~" 
Similarly, P(X > b) = e“ and P(X > a + b) = er) 
=e x e-4b 
_ PX>a+d) 
P(YX¥>at+b|x> OSes a 
= SEXO gh = PIX > b) 
e~ a 
CHAPTER 4 
Prior knowledge check 
1a 0.8944 b 0.2902 
2 a 0.6745 b 1.0364 
3a 14 b 27 c -41 


Exercise 4A 


1a W~N(130, 13) 


2 
3 


15 


16 


b W~N(30, 13) 


R~ N(148, 18) 


a T~N(180, 225) or N(180, 157) 

b T~ N(350, 784) or N(350, 287) 

c T~N(530, 1009) 

d T~ N(-40, 89) 

a A~N(35, 9) or N(35, 3°) b A~N(7, 6) 
c Aw~N(41, 41) d Aw N(84, 82) 
e A~N(19, 15) 

a 0.909 b 0.0512 ec 0.319 
d 0.0614 e 0.857 f 0.855 
a 10 b 9 c 0.136 
a 0.7881 b 0.2119 c 0.3400 d 0.2882 
a 64 b 148 c 0.0293 d 28 
a 0.4497 b 0.0495 

0.0385 

0.0537 

a i 0.268 ii 0.436 

b 37.1cm (3 s.f.) 

a 0.3768 b 0.6226 ce 0.9059 

0.732 

a i 60 ii 25 

b i R~N(5O, 20) ii 0.9320 

a 0.8644 

b Allrandom variables were independent — 


reasonable as games were chosen at random and 
game size and hard drive size are unconnected. 


Answers 


17 0.9044 


18 a 


0.7390 b 0.4018 


Challenge 
Var(X + Y) = E(X + Y)*) - (E(X + Y))? 


= E(X? + 2XY + Y*) — (E(X) + E(Y))? 

= E(X?) + 2E(X)E(Y) + E(Y2) — (E(X?))? - 
2E(X)E(Y) - (E(Y))? 

= E(X?) — (ECX))? + E(Y*) — (EY)? 

= Var(X) + Var(Y) 


Review exercise 1 


la 
b 


c 
2a 


eS oaAn o 


3 a Sy, = 71.4685, Sy, = 1760.459 
b y=0.324 + 0.0406x (3 s.f.) 
c 2461.95mm (2 d.p.) 
d /=2460.324 + 0.0406¢ 
e 2463.98mm (2 d.p.) 
f This estimate is unreliable, since it is outside the 
range of the data. 
4a&d 
140 
120 
_ 100 x 
S 
= 80 
5 
— 60 x 
40 
20 
0 
0 10 20 30 40 50 60 70 80 
x (% cocoa) 
b S,,= xy - aay = 28750 - Hb 20 = 4337.5 


Cc 
e 


y = -0.425 + 0.395x (3 s.f.) 
f= 0.735 + 0.395m (3 s.f.) 


93.6 litres (3 s.f.) 
= 120 
St 100 eR 
a 80 i x 
& 60 x * 
£ 40};—»% 
fo} 
3 20 
ina 
BO 

0 5 10 15 20 

Time (x weeks) 

The points lie close to a straight line. 
a= 29.02, b = 3.90 
3.90 ml of the chemicals evaporate each week. 
i 103ml ii 166ml 
i This estimate is reasonably reliable, since it is 


just outside the range of the data. 
ii This estimate is unreliable, since it is far outside 
the range of the data. 


Sy = 2821.875 

a=17.0,6=1.54 

i Brand D is overpriced, since it is a long way 
above the regression line. 

ii 69p or 70p since this is the predicted price for a 
bar of chocolate with 35% cocoa. 
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Answers 


10 


11 


12 


13 


iw 


seSongoamns aa 


as & 


anon 


1.749, -0.806, -1.05, -1.672, 0.5505, 1.084, 

-1.471, 1.7515 

A linear model is a suitable model as the residuals 

are randomly scattered about zero. 

14.3 (3 s.f.) 

The first sample as the RSS is smaller. 

Simm = 37.9, Sma = 24.0 (3 s.f.) 

d = -2.48 + 0.633m (3 s.f.) 

7.33 (3 s.f) 

3.75 (3 s.f.) 

-0.50765 

Not suitable since the residuals are not randomly 

scattered about zero. 

y = 41.9 + 264« (3 sf.) 
895.571 42 

RSS = 289771.4 x 3.388571 

-41.5, 2.1, -41.9, 10.5, -55.1, -72.7, 202.9 

1.8 

e.g. It could be a legitimate data point (a company 

that thrives despite (relatively low) spend on 

advertising). 

y =-22.9 + 279x (3 s.f.) 

£479 000 (3 s.f.) 

This estimate is reliable as it is within the range of 

the data. 

h = 80.6 + 1.921 (3 s.f.) 

167 cm (3 s.f.) 

2.2071 

The residuals are randomly scattered about zero so 

the model is suitable. 

30.8 (3 s.f.) 

The female sample since the RSS is lower. 


= 53100 (3 s.f.) 


Diagram A corresponds to —0.79, since there is 
negative correlation. 

Diagram B corresponds to 0.08, since there is very 
weak or no correlation. 

Diagram C corresponds to 0.68, since there is positive 
correlation. 


ee) 
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-0.816 

Houses are cheaper the further away they are from 
the railway station. 

-0.816 

ale 

Si = 983.6, Sinm = 1728.9, Sim = 1191.8 

0.914 (3 s.f.) 

0.914. Linear coding does not affect the correlation 
coefficient. 

0.914 suggests a relationship between the time 
spent shopping and the money spent. 

0.178 suggests that there was no such relationship. 
e.g. Shopping behaviours may be different on 
different days of the week. 

3 or 0.619 (3 s.f.) 

Hy:p = 0, H,:p > 0 

5% critical value = 0.6429 

0.619 < 0.6429 so do not reject Hy. The evidence 
does not show positive correlation between the 
judges’ marks, so the competitor’s claim is justified. 


r= -t or -0.733 (3 s.f.) 


Ho:p = 0; Hi: p <0 
5% critical value = -0.5636. 


14 


15 


16 


17 


a 


a 


-0.733 < -0.5636 so reject Ho. There is evidence of 
a significant negative correlation between the price 
of an ice lolly and the distance from the pier. 

The further from the pier you travel, the less money 
you are likely to pay for an ice lolly. 

The variables cannot be assumed to be normally 
distributed. 

r.=3 or 0.714 (3 sf) 

Hy:p = 0; H):p > 0. 

5% critical value = 0.8286. 

0.714 < 0.8286 so do not reject Hy. There is no 
evidence that the relative vulnerabilities of the 
different age groups are similar for the two diseases. 


ir, = 3 or 0.821 (3 sf.) 

ii H):p=0;H,:p>0. 
5% critical value = 0.7143. 
0.821 > 0.7143 so reject Hy. There is evidence 
of a (positive) correlation between the ranks 
awarded by the judges. 
& 


14 


12 


10 


0 1 2 3 4 5 6 
The strength of the linear link between two 
variables. 

Su= 26.589; Sp)= 152.444; S,= 59.524 

0.93494... 

Hj:p = 0, H,:p > 0. 

5% critical value = 0.7155. 

0.935 > 0.7155 so reject Hy; reactant and product 
are positively correlated. 

Linear correlation is significant but the scatter 
diagram looks non-linear. The product moment 
correlation coefficient should not be used here since 
the association/relationship is not linear. 

Ho:p = 0, H,:p < 0. 

5% critical value = -0.8929 

-0.93 < -0.8929 so reject null hypothesis. There is 
evidence supporting the geographer’s claim. 
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18 


19 


20 


21 


22 


eo & 


conf & 


i No effect since rank stays the same. 

ii It will increase since d = 0 and n is bigger. 
The mean of the tied ranks is given to each and 
then the PMCC is used. 


2 ae 
[ k(4x-x8)dx=15 [2ex# eal =1 


0 
=4k=1>k=4 


f(x) 


Av3 
9 


O 


ES 
hea 
i) 
R 


E(X) = 48 (or 1.07 to 3 s.f.) 


Mode = 2u3 (or 1.15 to 3 s.f.) 


Median = 1.08 (3 s.f.) 
Mean (1.07) < Median (1.08) < Mode (1.15) 
so negative skew. 


[kate -2)de=1 > k[tx8—29) 21 


K(9-9)-($-4))=1 3 k=3 


67 
1280 


0 x<2 
Hae -3x2+4) 2<x<3 

1 S38 
F(2.70) = 0.453 and F(2.75) = 0.527 


F(x) = 


0.453 < 0.5 < 0.527 so the median lies between 


2.70 and 2.75. 


F’(y) < 0 for 1.625 < y < 2 so his model cannot be 


a cumulative distribution function. 

k(24 + 22-2)-k(1+1-2)=1 

=> k16+4-2)=15 18k=1Sk=4 
18k=lsk=3 


203 
ag OF 9.705 (3 s.f.) 


0 


1<y<2 


otherwise 


f(x) 


0 2 3 x 


Mode of X is 3. 

1 

18 

Median of X is 2.71. 

Mean (2.67) < median (2.71) < mode (3) so 
negative skew. 


f(x) 


0.5-+-----~--5 


23 


cee f 


lo] 


=> oc & 


24a 


25 a 


a0 ft 


26a 


b 
c 


Mode of X is 3. 


iy x<0O 
aX" 0<1<3 
F(x) = 12 S 
2x-4t42?-3 3<x<4 
i x>4 
dl 
Median = V6 = 2.45 (3 sf.) 
2.272 (3 dp.) 
0.847 
F(0.59) = 0.491 and F(0.60) = 0.504 


Answers 


0.491 < 0.5 < 0.504 so the median lies between 


0.59 and 0.60. 
4x - 3x2 O<sx<l 
f(x) = 
” { 0) otherwise 
E(X) = 5 or 0.583 (3 s.f.) 
Mode = $ or 0.667 (3 s.f.) 
Mean (0.583) < median (0.59 — 0.6) < mode 
(0.667) so negative skew. 
[eax + [Fax =1 
bo a BON 


= [kx]? + tk nal} = 1 
>k(24+1n2)=1 


4 
24+ 1n2 


f(x)A 
1 


6 


(= 1.485...) 


-1 O 5 2 
E(X) = 2 
Var(X) = 3 
0.6 
1 
fQ) = {3 2<x<6 
0 otherwise 
E(x) =4 
Var(X) = 4 
0 x<2 
F(a) =(ia-2) 2<x<6 
1 x>6 

0.275 
Continuous uniform distribution 
f(x) 

1. 

5 

(6) 5 2 
E(X) = 2.5; Var(X) = 3 
5 d 0 
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Answers 


1 
a spon COPS 
0 otherwise 
b a=-2 B=6 
29 a 75cm b 43.3(3sf) oc O=2 
30 a 0.7586 
b The durations of the two rides are independent. 
This is likely to be the case as two separate control 
panels operate each of the rides. 
c D~N(246, 27) 
d 0.5713 
31 a 30 b 4.84 c 0.5764 
32 a 0.7377 b 0.6858 
33 a 0.9031 b 0.8811 
34 a 0.1336 b 0.8413 c 0.1610 
d Allrandom variables are independent and normally 
distributed. 
Challenge 
1 a i Linear model: y = -2.63 + 2.285x 
ii Quadratic model: y = 1.04 + 0.1206x + 0.2353x? 
iii Exponential model: y = 1.1762e°*45 
b_ Linear residuals: 1.845, -0.925, -1.21, -1.295, 


0.435, 1.15 

Quadratic residuals: 0.1041, -0.2195, 0.0128, 
—-0.0255, 0.3861, -0.264 

Exponential residuals: —0.16644, -0.04507, 
0.560703, 0.785369, 0.321593, -2.29619 

Hence quadratic model is most suitable as the 
residuals are smaller and are randomly scattered 
around zero. 


2 af ke-dx =klLele =k => k=1 


b 
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_{0 x<0 
Fad= {Pgs x=0 
- a ee -1 
e!-e* or ; 

A 
F(x) 
tt gic oe Seay 


> 
y 1 a 


Fy(y) = P(X, sy) 9... 1 (%, Sy) 
=F,Y... F,W=y" 
So £,Y) = ny"! 
1 1 
10 = (eae =n na 2 
V0.5 


z O<zsl 
fy=);2-z 1<z<2 
yA 0 otherwise 


O 1 2 x 


$2 O<z<l1 
3 3)? 
d £(2 = 3- (2-3) 1<z<=2 
giz-3P 2<z<3 
0) otherwise 
CHAPTER 5 
Prior knowledge check 
1a 0.391 b N(25, 45) 
2 Reject Hy (p-value = 0.00143) 
3 0.1870 
Exercise 5A 
1 i a N(10p, 1002) b Niu, 320%) 
2 =| 
c N(O0, 100?) d N(x, 10 
e N(O, 100?) f N(O, 10) 
ii a,b, d,e, are statistics since they do not contain py 
or o the unknown population parameters. 
2a y= E(x) = % or 4.4 
o? = 11.04 or 22 
b (1,1) (1, 5p? 4, 10° 
(5, 5} {5, 10}? 
{10, 10} 
© liz 1 | 3 | 5 |55] 7.5] 10 
y¥_~| 4 8 4 4 4 1 
P(X =%)| 35 | a5 | a5 | 25 | 25 | 35 
e.g. PX = 5.5) = 5 
d E(X)=1x5+3x$+...4+10xp=44=y 
Var(X) = 1? x £+3?xS4...+10?x $- 4.4? 
2 
=5.52-2 
5.5 9 
3 a *¥=19.3, 5? = 3.98 b % = 3.375, s? = 4.65 
© X= 223,s?=7174 d % = 0.5833, s? = 0.0269 
4a 36.4, 29.2 (3s.f.) b 9,4 
ce 1.1, 0.0225 d 11.2, 2.24 (3s.f.) 
5 a Anestimator of a population parameter that will ‘on 
average’ give the correct value. 
b % = 236,s°=7.58 
6 X= 205 (3s.f.), s? = 9.22 (3s.f.) 
Ta p= 2. = 2 
b (5,5, 5} 
{5, 5, 10}? {5, 10, 10}° 
{10, 10, 10} 
ec fa 20 | 25 
PX =%) 7 7 m7 7 
d E(X)= 2 = 1, Var() = 50.2 


27. 3 
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10 


11 


12 
13 


14 


© lm 5 | 10 
P(M=m)| 2/4 
f EW =6.296... 
Var(M) = 4.80... 
g Bias = 1.30 (3s.f.) 
a 10p 
= XxX, Fee Xo5 
X = 5 
p 25 
— E(X;) +... + ElX5) 25 x 10p 
ELX) = = = 10 
i" 25 25 P 
“_X is a biased estimator of p 
so bias = 10p - p = 9p 
c « is an unbiased estimator of p. 
a E(X)=0 
ae 
E(x?) = = 
b Y=X7+X3 + X3 
2 
E(Y) = E(X?) + E(X3) + E(X3) = = x3 =a? 
Y is an unbiased estimator of a2 
a gy =16.2,s, = 12.0 (3s.f) 
b w=15.92,s,? = 10.34 
c Standard error is a measure of the statistical 
accuracy of an estimate. 
Ss 
d —+ = 0,632 (3sf), —£ = 0.633 (3s), 
V20 V30 
Ss 
“= 0.455 (3s.f.) 
V50 
e Prefer to use w since it is based on a larger sample 
size and has smallest standard error. 
a *=65,s,? = 9.74 (3s.f.) 
b_ need a sample of 37 or more 
c No. Because the recommendation is based on 
the assumed value of s? from the original sample 
OR The value of s? for the new sample might be 
different /larger. 
d 65.6 (3.s.f.) 
Need a sample of 28 (or more) 
a 4.89 
b 0.0924 (3s.f.) 
c Need n = 35 (or more) 
a_ E(X,) = np, E(X,) = 2np, Var(X,) = np (1 - p), 
Var(X,) = 2np (1 — p) 
X, 
b Prefer oe since based on larger sample (and 
therefore will have smaller variance) 
2%, #2 
Oe 4 n* on 
1/E(%) es) 4 ("” 20? 
Bo = { nm" 2n)~ 2\n* on 
1 
=51P + P)=P 
X is an unbiased estimator of p 
d Y= (“ + *) 
3n 


E(X{) + EX) np + 2np _ 
3n 3n e 
Y is an unbiased estimator of p 


E(Y) = 


Answers 


e Var(Y) is smallest so Y is the best estimator. 
¢ 2 
15 a p=1,07=0.8 or 4 
b {0,0,0} {0,0, 1? {0, 0, 23 
{1,;.1; 1} (1,:1,,0)% (1,.1, 2)" 
{2,2, 2}. {2,.2,0%8 {2,2,1}8 {0,1, 2}3!=6 
ec fo 1 2 4 5 
G 0 = 7 1 3 3 2 
Ta 8 i | 30 | 25 | 30 | 12 8 
P(X =%) | aos | a5 | ios | tos | tos | i258 | io 
d E(X¥)=1(€yp) 
yy. 4_ o 
varl®) = = (3) 
e jn 0 1 2 
44 | 37 | 44 
PIV =n) | 355 | i235 | i25 
f EW=1 
Var(N) = 0+ 1? x 224 2?x 44-1? = 38 (= 0) 
g EW)=1=p 
h_ X because Var(X) is smaller. 
Challenge 
1 a ar\2 sxx n (= x] 
x;,- xX) = = - 
- niet! ) n-1 n-1\ 1 : 
ae 2 aa aed 
=a (ue nk?) 
b o? = Var(X) = E(X?) - pi? 
E(X2) = 02 + 2 (1) 
var(X) = & and EQ) = p 
© = BR) - pe? 
nm Lb 
ran o 
BX) =F + pe? (2) 
2 as 1 ‘X2 - nxX2 
Loe nk 
E(S?) = — E(S_X? - nX?) 
== 1 7 (EIDLX*) - nEC%) 


E(S*X?) = STE(X?) = nE(X?) 


E(S’) = oo 


1 


-1 


=o? 


(E(X?) — nE(X?)) 


(nic? +2) — nf 24 12)) by (1) and (2) 


So the statistic S* is an unbiased estimator of the 
population variance o?, and s? is an unbiased estimate 
for o?. 


Exercise 5B 


uF 
2 
3 


a 
a 
a 


b 


i") 


a 


(124, 132) 
(83.7, 86.3) 


b (123, 133) 
b (83.4, 86.6) 


Niall must assume that the underlying population is 
normally distributed. 


(20.6, 25.4) 
n= 609 


b n=865 c n=1493 


A 95% confidence interval for a population 
parameter @ is one where there is a probability of 
0.95 that the interval found contains 6. 


(1.76, 2.04) 
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Answers 


(304, 316) (3 s.f.) b 0.0729 

a Yes, Amy is correct. By the central limit theorem, for 
a large sample size, underlying population does not 
need to be normally distributed. 

b (73113, 78 631) (Mearest integer) 

or (73100, 78 600) (3 s.f.) 

Must assume that these students form a random 

sample or that they are representative of the 

population. 

b (66.7, 70.1) 

c If = 65.3 that is outside the C.I. so the examiner’s 

sample was not representative. The examiner 

marked more better than average candidates. 


2 
R= Vertes So = 200, 108 


12° 12~«O3 
b (77.7, 79.7) (3s-£) 


10 a (23.2, 26.8) is 95% C.I. since it is the narrower 
interval. 
b 0.918 ce 25 
11 a (130,140) b 85% c Need n= 189 or more 


12 (30.4, 32.4) (3s.f.) 
13 (258, 274) (3s.f.) 
14 a 0.311 


b 0.866 ec (21.8, 23.3) (3s.f.) 


Exercise 5C 


1 Ho: wy = he. Hy: a, > , 5% cv. is z= 1.6449 
pepe me slole! 24.9690. 
ae 
15 20 


1.3699 < 1.6449 so result is not significant, accept Hp. 
2 Ho: fy = Mo Hy: wy AM. 5% CV. is Z=+1.96 
a = (51.7 - 49.6) - 0 
[4.2 | 3.6 
30 7 25 
[Choose x», - X, to get z > 0] 
ts. Z= 1.996... > 1.96 so result is significant. Reject Hp. 


3 Ho: fy =o Hy: wy, < po 1% cv. is Z=-2.3263 
ip = Se ED: 2 56ae... 
0.81? 0.75? 
25 36 


ts. = -2.3946... < -2.3263 so result is significant 
Reject Ho. 


4 Ho: wr = be Hy: py A pe 1% C.V. is Z=62.575, 
tgag a MIZUHO _ 5719... + 25758 
8.2? 11.3? 
85 100 


significant result so reject Hp. 

Central limit theorem applies since n,, n, are large and 
enables you to assume X, and X, are both normally 
distributed. 


5 Ho: fr = M2 Hy: yy > pp 5% CV. is Z= 1.96 
ig ya EO ee aa65. 2 1.8 
18.3? 15.4? 
100 150 


Result is not significant so accept Hp. 
Central limit theorem applies since n,, n, are both large 
and enables you to assume X, and X, are normally 
distributed. 
6 Ho: fy = be Hy: yy < po 1% CV. is Z=-2.3263 

a (0.863 - 0.868) -0 

[ 0.013? | 0.0152 

120 1 90 
= -2.5291... , -2.3263 

Result is significant so reject Hp. 
Central limit theorem is used to assume X, and_X, are 
normally distributed since both samples are large. 
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7 Not significant. There is insufficient evidence to suggest 
that the machines are producing pipes of different lengths. 


8 a Ho: Lnew — Hola = 1; Hy: Hnew — Mola > 1 


b 


Test statistic = 3.668... > c.v. (1.6449), therefore 
evidence that yes, the mean yield is more than 

1 tonne greater. 

Mean yield is normally distributed; Sample size is 
large. 

Ho: grain — Lgrain/grass = 0; Hy: grain # Lgrain/grass 

Test statistic = 2.376... > c.v. (1.96), therefore 
evidence that there is a difference between the mean 
fat content of the milk of cows fed on these two diets. 
Mean fat content is normally distributed. 


Challenge 


Ny Ny 
Lai ~ Lyi AE + yf 


Ny, + Ny 


yon 


b (45.76, 47.33) 


Exercise 5D 

1 2.34 > 1.6449 so the result is significant. 
There is evidence that Quickdry dries faster than 
Speedicover. 


a 


c 


ts. 


Not significant. There is insufficient evidence to 
confirm that mean expenditure in the week is more 
than at weekends. 

We have assumed that s, = 0, and s, = 0». 

Not significant. Insufficient evidence to support a 
change in mean mass. 

We have assumed that s = o since n is large. 

Not significant so accept Hp. 

ts. = 1.6535... > 1.6449 

Significant so reject Hy. 

We have assumed that s, = 0, and s, = 0, since the 
samples are both large. 

= -1.944... < -1.6449 


Significant. There is evidence that the weights of 
chocolate bars are less than the stated value. 
6 a Result is significant. There is evidence of difference 


in mean age of first-time mothers between these 
two dates. 

There is no need to have to assume that both 
populations were normally distributed since both 
samples were large so the central limit theorem 
allows you to assume both sample means are 
normally distributed. 

We have assumed that s, = 0, and s, =o». 


Mixed exercise 5 
1a Hy: p =0.48, H, uw 4 0.48; Significance level = 10%; 


2 


b 
a 


0.48 is in confidence interval so accept Hp. 
(0.4482, 0.5158) 
i and iii are since they only contain known data; 
ii is not since it contains unknown population 
parameters. 

_ 170? 
a9 
t.s. = 2.645 . 1.6449 
Result is significant so reject Hp. 
There is evidence that the new bands are better. 
(46.7, 47.6) (3s.f.) 
4.53, 0.185 (3d.p.) 
0.3520 


b (4.10, 4.96) (3s.f.) 
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Answers 


5 14.01, 0.04 (2d.p.) cS var(T) = (0222 ,20-n(- 
6 The smallest value of n is 14. dr 7 mt 
7 a (41.1, 47.3) Bs-f) d Var(T) =0>rm=(1-rnn>=rm+ndensr=— 
b ts. =-1.375 > -1.6449 dr ae 
Not significant so accept Hp. There is insufficient d nX + mY 
evidence to support the headteachers’ claim. m+n 
2 
8a X~Nyu, = 
(1 n CHAPTER 6 
b_ Exact because X is normally distributed. Prior knowledge check 
ce need n= 28 or more 
1a 264 b 13.8 
9 a 0.75, 4.84 
P > eee 2 1.690 
b i assume that_X has a normal distribution 
se 3 (16.0, 19.0) 
ii assume that the sample was random 
c (0.346, 1.85) (3s.f) Exercise 6A 
d_ Since 0 is in the interval it is reasonable to assume : (n-1s? (n—1)s? 
that trains do arrive on time. 1 Confidence interval = | [—7,.-— a 
10 a X~N(u, =] Xni(Z) Xna(1- 5) 
a e(teeae ints 
95% C.I. is an interval within which we are 95% 26.119 ° 5.628 
confident m lies. = (2.573, 11.938) 
Cc (3.37, 14.1) (3 s.f.) a 1 1322 
d (9.07, 10.35) (nearest penny) 2 %=6.62 s'= i5(884. i 13) aD Use 
e ts.= 1.8769... . 1.6449 ; (n-1)s? (n—1)s? 
Significant so reject Ho. There is evidence that the Confidence interval = | — a\’\2 (1-2 
mean sales of unleaded petrol in 1990 were greater Xn 2 Xnea( 3) 
than in 1989. - (a tele tog | 
f n=163 30.114 ° 10.117 
11 a A 98% C.1. is an interval within which we are 98% = (0.259, 0.772) 
sure the population mean will lie. 3 % =2.878... s?=0.458... 
b 0.182 (3d.p.) fi . p_ (MDs _- 1s" 
c (9.84, 10.56) (24.p) Confidence interval = Z (3) 2 (i ~ a) 
d n=13 Seg a 2 
2 
12 a E(X)= Vi xX =e = (2 Ook LOA 8 | 
ee - =a , 24.736’ 5.009 
- es > oO 
bi X~N(u. Z| il X~N(u, | = (0.241, 1.191) 
ce (17.7, 19.3) 8s.f) 4 (0.731, 16.835) 
d n= 278 or more 5 a (22.099, 112.45) b Normal distribution 
e t.s.=-2.1176... < -1.96 6 (148.137, 1043.704) 
Significant so reject Ho. There is evidence that the 7 6972.0 
mean of till receipts in 2014 is different from the 8 (5.04, 8.18) 
mean value in 2013. 9 a (0.1684, 0.7625) 
13 a Approx 83 samples will have mean < 0.823 b_ Diameters of fastenings have an underlying normal 
b (0.823, 0.845) (3 s.f.) distribution. 
c Since 0.824 is in the C.I. we can conclude that there c Lower limit of CI > 0.15 therefore the machine 
is insufficient evidence of a malfunction. needs recalibrating. 
14 a Ho: fg = My: Hy: Wy > My Critical value is 2.6512 . 
which is greater than the sig level (1.6449) so reject Exercise 6B > 
Hy - cardiologist’s claim is supported. 1a £=16.605 s?= 5583.63 — 20(16.605) = 3.637... 
b Assume normal distribution or assume sample 19 
: at . b Hj:07? =1.5 H,:0?>1.5 
sizes large enough to use the central limit theorem; ae : 
Assume individual results are independent; Assume Critical region > 30.144 
o* = s’ for both populations. _ 4, (n—-1)s?_ 19 x 3.637... 
T = = = 46.07 
© 58.5 3s.) est statistic = 15 6.073 
The test statistic is in the critical region so reject Hp. 
Challenge There is evidence to suggest 02 > 1.5. 
a T=rX+sY 2 xX =0.337 s*?=0.00286... 
E(T) = ru + su = (r+ s)u H):o?=1.5 H,:02<0.09 
So if Tis unbiased thenr+s=1 Critical region < 2.700 
b r+s=1=>s=1-r 4, (ns? 9 x 0.00286... 
. P=78 #0 -a¥ Test statistic = = = 0.09 = 0.287 
Var(T) = r?Var(X) + (1 - r)’Var(Y) The test statistic is in the critical region so reject Hp. 
agad?, “ne a2 o(Z . (= ia Ue is evidence to suggest that variance is less than 
n m n m 09. 
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Answers 


3 Hy 
X=5.74 sea 
Critical region < 2.7 sant => 19.023 


:0=41 Hy:0441 6 
6.940.. 


(n mile _ 9x 6.940... _ 45 935 
a 4.1 


The test statistic is not in the critical region so do not 
reject Ho. 7 
There is no evidence the variance does not equal 4.1. 
4 H):o7=1.12? H,:0741.12? 
Critical region < 8.907 and = 32.852 
(n- 1)s*_ 19x 1.15... 
a 112? 
The test statistic is not in the critical region so do not 
reject Ho. 
There is no evidence the variance does not equal 1.12. 


Test statistic = 


Test statistic = = 17.419 


5 a X is an unbiased estimate for wy. 8 
s’ is an unbiased estimate for o?. 
—_ 149.941 
= ** = 9.996..., 
7 15 9 
2 = 1498.83 - 7 69068 _eaanegne. 
b Hy):02=0.04 H,:0240.04 
Critical region < 5.629 and = 26.119 
ae (n-1)s* _ 14 x 0.0006977 
Te = = 
est statistic 2 0.04 
= 0.244 
The test statistic is in the critical region so reject Hp. 
There is evidence that the variance is not 0.04 
s? = 0.06125 
b Hy):02=0.19 H,:0240.19 
Critical region < 2.167 and = 14.067 2 
= 2 
(n-1)s* 7 x 0.06125 = 2.256 


Test statistic = we 0.19 
The test statistic is not in the critical region so do 
not reject Ho. 
There is no evidence that o? does not equal 0.19. 3 
Hy :02= 110.25 H,:02 < 110.25 
Critical region < 10.117 
(n-1)s*? _ 19 x 8.52 
a 110.25 
The test statistic is not in the not critical region so 
do not reject Hp. 
There is no evidence that the variance has reduced. 4 
b Take a larger sample before committing to the new 
component. 
3.212, standard error = 0.0875 (3 s.f.) 
b Ho: 0 = 0.25, H,: 0 4 0.25, CI. = (2.700, 19.023), 
Test statistic = 11.03616. Hence not enough 5 
evidence to show that the standard deviation is 
different from 0.25. 


1 


Test statistic = = 12.451 


Exercise 6C 


1a 2.34 h 3:36 € 3.37 
1 1 
oe 2024 hb — sides 6-4 cp tos 
8,6 Fro 95 Fs 
3 4 337 b 4.20 c 6.06 6 
1 1 1 
4a — -00370 b — =0176 ce — =0.101 
Fi23 Fiz 
5 a 3.07, =0.299 b 2.91, = 0.364 
10,8 Se 
c 5.41, F701 
24h ' Ontine ) 


a P(X < 0.5) = Pig i. < 0.5) 
= PFs > 35) 
= P(Fyp49 > 2) 
From the tables F,, 4,(0.05) = 2 
* P(Pizao > 2) = Pl o12 < 0.5) = 
P(X < 3.28) = 1 - P(Fin. > 3.28) 


0.05 


=1-0.05=0.95 
1 \ 
P(X < 555) = P(Fias < 3-95 | 
= P(Fe 1p > 2.85) 
1 
~ Hep) 009 
1 _ _ aie 
Plage < 4-506) P(X < 5.06) P(X < she) 
= 0.95 - 0.05 
= 0.90 
P(X < 9.55) =1- PF), > 9.55) 
=1-0.01 


= 0.99 
a_ Upper tail: PCY > 3) = P(Fe42 > 3) = 0.05 
So P(X < 3)=1-0.05 = 0.95 
Other tail: 
P(X < 0.25) = P(Fo12 < 0.25) = P(Fin, > 4) = 
So P(0.25 < X < 3) = 0.95 - 0.05 = 0.9 
b °C,(0.9)?(0.1)? x 0.9 = 0.00109 


0.05 


Exercise 6D 


Critical value is Fy = 4.06 


7.6 
Frost = 2 = 1.1875 
test 6.4 


Not in critical region. 
Accept H, - there is evidence to suggest that o1? = 02? 
Critical value is F449 = 2.29 


0. 5a 2.4706 


test — 


In critical region. 
Reject Hy — there is evidence to suggest that o17 > o2? 
a Critical value is F,,, = 3.28 


225 
Fost = “63 = 3.57 


In critical region. 
Reject Hy — there is evidence to suggest that the 
machines differ in variability. 
b_ Population distributions are assumed to be normal. 
Critical value is Fs. = 2.85 


52.6 
Fe = oo = 1.44 
‘st 36.4 3 


Not in critical region. 
Accept H, - there is evidence to suggest that o1? = 02? 
a OGoodstick” = 1.363 

Oolatight” = 0.24167 

Critical value is F,; = 5.19 


1.363 _ 5 64 
0.24167 
In critical region. 


Reject Hy — there is evidence to suggest that the 
variances are not equal. 
b Holdtight as it is less variable and cheaper. 
O chop? = 22 143.286 
Opicabak” = 6570.85238 
Critical value is Fy 4, = 2.85 
22 143.286 
Fos = 657085238 0)” 
In critical region. Reject H, — there is evidence to 
suggest that their variances differ. 


test — 


Full worked solutions are available in SolutionBank. 


7 a p, = 1046, sz! = 1818. 
S22 = 1200.21 
b Critical value is F,, = 3.73 


— 1818.111 
“1200.21 
Not in critical region. 


11 and p, = 997.75, 


= 1.5148 


Accept Hy — there is evidence to suggest that 


ov = 02? 


c Given that the variances appear to be equal, use 


present supplier who 
mean. 


Mixed exercise 6 
1 a Confidence interval = 


appears to have a higher 


(n-1)s?  (n - 1)s? 
SU) 0-4 


b Confidence interval = 


13x 1.8 13x al 


24.736’ 5.009 
(0.946, 4.672) 
(n — 1)s? 


(n — 1)s? 
SG) 0-7 


= 71.4, 


13x 1.8 13x a8) 
22.362 5.892 


(1.046, 3.971) 


2x 102 286 - 20 x 71.4? _ 479 


19 


a Confidence interval = 


(n-1)s? (n - 1)s? 
Se) 0-7 


b 10=1.64490 sog= 


10 
1.6449 


[4 x 17.2 19x | 
32.852 ° 8.907 


(9.948, 36.69) 
= 6.079 


c v36.69 < 6.079 so the supervisor should not be 


concerned. 


Answers 


7 P5310 = 3.33) = 0.05 => b = 3.33 
P(Fyos = 4.74) = 0.05 = P(Fs10 < rar = 0.05 
so a= 0.2110 (4 s.f.) 
8 a Hy: 07% = 02"; H,: 01? 4 02? 
si? 4° 
“t= = = 3.0625 
S27 Be 3.06 
C.V.: Fy) 7 = 3.57 
Since 3.0625 is not in the critical region, there 
is insufficient evidence to reject Ho. There is 
insufficient evidence of a difference in the variances 
of the lengths of the fence posts. 
b_ The distribution of the population of lengths of 
fence posts is normally distributed. 
9 Hy : or? = om? H, : or? 4 on? 
si? = (17 956.5 ~ 7 x 50.63) = 33.8 — 5.66333... 
su? = § (28 335.1 - 10 x 53.24) = a = 3.63333... 
Si = 1.5587... 
SM 
Foy = 3.37 
Not in critical region. There is no reason to doubt the 
variances of the two distributions are the same. 
Challenge 
(n - 1)S? 2 
ae Xa 
i 2 
Var = (x?_,) = Var(@ ws =2n-1) 
oO 


Take out factor from variance and square 


(n - eae =2n-1) 
o 
=> Var(S2) = 2a" 
n-1 


Variance decreases as n increases. 


CHAPTER 7 


a A confidence interval for a population parameter is 
a range of values defined so that there is a specific 
probability that the true value of the parameter lies 
within that range. 

b (2.98, 5.14) 

Hy: ¢ = 2.7, Hy: o F 2.7 

Critical regions are 


(n — 1)s? (n — 1)s? 
2 


> 14.449 and = 

Test statistic = 6.58 

6.58 is not in the critical region, so not enough 

evidence to show that the standard deviation is 

different from 2.7. 

a (3.266, 8.669) 

b Times taken have an underlying normal 
distribution 

c Lower limit of C.I. > 3.1 therefore the dosage 
needs changing. 

x =45.1 s=6.838... 

H):¢=5 H,:045 

Critical region > 19.023 and < 2.700 

(n - ae _ 9 x 6.838... _ 16 936 
oO be 

There is insufficient evidence to reject Ho, therefore 

there is evidence that the variance has not altered. 


= 1.237. 


Test statistic = 


Prior knowledge check 
1 [15.12, 16.88] 
2 z=1.953... so no evidence (z < 1.96). 


Exercise 7A 
la 
when ¢t = -2.179 
b P(X> 2) =0.025 when ¢ = 1.782 
c P(X >t) =0.025 when t = 2.179 
P(\X| > 2) = 0.95 when |¢| = 2.179 


2 a 2.479 b 1.706 
3 a P(Y>0=0.05 when t= 1.812 so P(Y < 1) = 0.95 
when ¢t = 1.812 


b P(Y> 0 =0.005 when ¢ = 2.738 so P(Y > 0) = 0.005 


when t = —-2.738 


ce P(Y> 1t)=0.025 when t = 2.571 so P(Y < t) = 0.025 


when t = -2.571 


d P(Y> 7) =0.01, when ¢t = 2.583, and P(Y < t) = 0.01 


when ¢ = —2.583 

so P(|Y| < ¢) = 0.98 when |t| = 2.583 
e P(¥Y> 0 =0.05 when ¢t = 1.734 and P(Y < t) = 0.05 

when ¢ = -1.734 

so P(\Y| > 8) = 0.10 when |t| = 1.734 
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P(X > 0) = 0.025 when ¢ = 2.179 so P(X < t) = 0.025 


Answers 


4 


10 
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X=20.95 s=3.4719... n=8 v=7 


confidence limits = x + tun (5) x a 
—ena1e05 


= 18.624 and 23.276 
Confidence interval = (18.624, 23.276) 


X=12.4 s=v21 n=16 v=15 


confidence limits = ¥ + ta-n(S) x a 


v21 
=12.44 2.131 x == 
v16 


= 9.9586 and 14.8413... 
Confidence interval = (9.959, 14.841) 
a xX =179.333333 s=5.5015... n=6 v=5 


confidence limits = ¥ + tn-n(S) x = 


5.5015... 


= 179.333 + 2.015 x 


= 174.808 and 183.859 
Confidence interval = (174.808, 183.859) 


b confidence limits = ¥ + ta-n(S) x a 


5.5015... 


= 179.333 + 2.571 x 


= 173.559 and 185.108 
Confidence interval = (173.559, 185.108) 
a x=10.36 s=0.73363... n=10 v=9 


confidence limits = ¥ + tan (5) x a 
06k erie een 


= 9.706 and 11.014 
Confidence interval = (9.706, 11.014) 
b Masses are normally distributed. 


224.1 980195 = (6337.39 u 


= 8.54125 
$=2.92254... n=8 v=7 
t 


confidence limits = ¥ + t,,_1) ie 


ee 224.12 
8 


= 28.0125 + 3.499 x === 


= 24.397 and 31.628 
Confidence interval = (24.397, 31.628) 
BEH122 s=v225=25 = 25 


confidence limits = * + ta-n(5) x 


S. 
vn 


225 
= 122 + 2.060 
* 26 


= 115.940 and 128.060 
Confidence interval = (115.94, 128.06) 


Normal | \? 


For the population mean, using 
a sample of size 50 from a 
population of unknown variance 


For the population mean, 
using a sample of size 6 from a v 
population of known variance 


For the population variance, 


using a sample of size 20 sf 


7 a X=1.085 s? 


11 a Population variance is unknown 


b (468.7, 509.7) 
c The lifespan is normally distributed 
d (24.19, 112.2) 


Exercise 7B 
1 ¥=11.4 s=1.816.... 


Hgtee 11 Ags ged 

Critical region ¢ >2.132 

X¥-w 11.4-11.0 
S_ 1.816... 

‘ v5 

The result is not in critical region. 

No evidence that y is not 11 

%=17.1 s=2 

Ho:w=19 Hy: w<19 

Critical region ¢ < -2.473 


H-_ 171-19 _ 
3 z -5.027 


Test statistic ¢ = 0.492 


Test statistic ¢ 


The result is in the critical region. 

There is evidence that pis < 19 

X=3.26 s=0.8 

Ho: w=3 Hy: ps3 

Critical values 6 2.179 

Critical region ¢ < -2.179 or t > 2.179 

% HH _3.25-3 4479 


Test statistic ¢ 


=. 0.8 
vn V3 


The result is not in the critical region. 
There is no evidence that pu is not 3 

a Population variance is unknown. 
b X=98.2 s=15.744... 
H):=100 H,:4100 

Critical region < -2.145 or > 2.145 
X-p 98.2 -100 _ 


Test statistic ¢ 0.443 
est Statistic = 15.74... 
” V15 


The result is not in the critical region. 
There is no evidence that is not 100 
%=1048.75 s=95.2346... 
Hy): =1000 H,:> 1000 
Critical region ¢ > 1.895 
Test statistic ¢ =~ 3 H _ 1048.75 = 1000 _ 
a ee 

m 8 
The result is not in the critical region. 
There is no evidence that ys is not 1000 
X = 6.4857... s? = 0.853626... s=0.923919... 
Ho: 4=6 H,:p>6 
Critical region ¢ > 2.160 
E-  6.4857142 - 6 


1.448 


Test statistic ¢ = = 1.967 


7 0.923919... 
. V4 


The result is not in the critical region. 

There is no evidence supporting manufacturer’s claim. 
28.4 - ace 1.085? _ 0.2555... 

s= 0.5055... 

H):u=1.00 H,:,<1.00 

Critical values t < -1.328 


' Online ) Full worked solutions are available in SolutionBank. 


¥-}  1.085-1 


Test statistic ¢ = 05055. 0.752 
e v20 


The result is not in the critical region. 
There is no evidence that pu is not 1.00 
b Amounts of radiation are normally distributed. 

8 a Hy: uw = 100; H,, » > 100, Test statistic is 2.98... c.f. 
critical value of 1.729 therefore reject H, — there is 
evidence that the training improves IQ. 

b Hj: 0 = 12, H, o 4 12, Clis (11.91, 20.56) so since 
stated value is in CI, accept Ho. 


Exercise 7C 
1 aii Ho:w=0 H,: yO 
ii Hp: p=0 H,: py >O0 
b Sid=30 Sod? = 238 
a@=5 


= 2 
go 238 - GO! 2476 


s=4.195 
Critical value ¢;(5%) = 2.015 
The critical region is t > 2.015 


= 2.919 
In the critical region, reject Ho. 
There is evidence to suggest that there has been an 
increase in shorthand speed. 
2 Ho: mp =O Hy: py > O 


Sd=5 Soa=59 

d =0.5 

@892 ney oe 
s = 2.50555 


Critical value t,(1%) = 2.821 
The critical region is t > 2.821 
p- 05-0 
2.50555 
v10 
= 0.631 
Not in the critical region. Do not reject Ho. 
There is evidence to suggest that paper 2 is easier 
than paper 1. 
3 a Ho:uwp=0 Hy: p,>O0 
Yid=47 Sod? =315 


d =4.7 

ae 2 
ss 315 was - 10.456 
s = 3.234 


Critical value ¢,(5%) = 1.833 
The critical region is t > 1.833 

_ 47-0 

~ 3.234 

v10 

= 4.596 
In the critical region. Reject Ho. 
There is evidence to suggest that chewing the gum 
does not reduce the craving for cigarettes. 

b_ The differences are normally distributed. 


Answers 


4 Ho: m=0 Hy: py >0 
Yid=46 Sid? = 336 


d =4.6 

_ 2 
2 = 336 — — 13.8292 
s = 3.7178 


Critical value t,(5%) = 1.833 
The critical region is ¢ > 1.833. 

46-0 
t= 37178" 3.913 

v10 
In the critical region. Reject H,. There is evidence to 
suggest that the journey times have decreased. 
5 a Ho: p~p=0 Hy: py 40 
Yid=20 Sd? = 604 


d =2.5 

_ 2 
Sa eee = 79.1429 
s = 8.896 


Critical value ¢,(5%) = 1.895 
The critical regions are ¢ > -1.895 and ¢ > 1.895 
2.5-0 
= = 0.7 
, 8.896 ma 


v8 


Not in the critical region. Do not reject Hy. 
The mock examination is a good predictor. 
b_ The differences are normally distributed. 


6 a_ Different people will have different productivity 


rates. Need a common link if want to compare 

before and after. This reduces experimental error 

due to differences between individuals so that, if a 

difference does exist, it is more likely to be detected. 
b Ho: ~p=0 Hy: pp >O0 


did=65 Sod? = 569 


d =6.5 

= 2 
i OE ane) = 16.278 
s = 4.0346 


Critical value t,(5%) = 1.833 
The critical region is t > 1.833 
_ 6.5-0 


v10 
In the critical region. Reject Hy. 
There is evidence to suggest a tea break increases 
the number of garments made. 


7 Ho: m=0 Hy: py >O0 


Yid=8.6 od? = 20.78 


d= 1.075 

— 2 
ge = 20-78 oe peine 
s = 1.2837 


Critical value ¢,(1%) = 2.998 
The critical region is t > 2.998 

1.075 -0 
t= “Toga7 = 2.3686 

v8 

Not in the critical region. Do not reject Ho. 
There is evidence to suggest that the drug increases 
the mean number of hours sleep per night. 
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Answers 


Exercise 7D 
(9 x 4) + (14 x 5.3) 


la s’%= = 4.7913 


10+15-2 

S52 2.189 

t23(2.5%) = 2.069 

(25 - 22) + 2.069 x 2.1895 +4 =3 4 1.849 
= (1.151, 4.849) 


b_ Independent random samples, normal distributions, 
common variance 


2a ¥,=8.9125  s2= 0.58125 
E,,= 12.04 — sn? = 0.84933 
(7 x 0.58125) + (9 x 0.849) 
2 
Sp = Ga = 0.7319 


Ss, = 0.8555 
ty6(5%) = 1.746 
(12.04 - 8.9125) + 1.746 x 0.8555 +4 


= 3.1275 + 0.7081 = (2.419, 3.836) 

b_ Population variances are equal - reasonable since 
the compost is designed to increase the amount of 
growth, not the variability. 

> (19 x 6.12) + (19 x 5.22) _ 

20+ 20-2 

s, = 2.381... 

ty (0.5%) = 2.712 

(38.2 — 32.7) + 2.712 x 2.381... /44+4 


20 * 20 
= 5.5 + 2.042 = (3.46, 7.54) 
normality and equal variances 


c zero not in interval > method B seems better than 
method A 


4 Assume same variances, 
2 = 2% 32.488) + (9 x 33.344) 
10+ 10-2 

S = 5.73725 

t,3(5%) = 1.734 

(18.6 - 14.3) + 1.734 x 5.73725 144 

= 4.3 + 4.44905 = (-0.149, 8.749) 
5 a Ho: a4? = a9”, Hy: on? 4 9”, C.V. = 3.87, 
test stat = 1.33, therefore no evidence that there is 
a difference in variability. 
Assumption: samples are taken from normally 
distributed populations. 

b Can assume population variances are equal — this 
is one of the fundamental requirements to use 
t-distribution. 

ce (0.752, 3.85) 


3 as; 5.67 


= 32.916 


Exercise 7E 
_ (19 x 12) + (10 x 12) © 
20+11-2 
b Ao: Mase = Mena Ay? Mise F Mena 
critical value t,(0.025) = 2.045 
critical region is ¢ < -2.045 and t = 2.045 


_ (16-14)-0 


3.464/54+2 
Not in critical region — do not reject Ho. 

There is evidence to suggest that the populations 
have the same mean. 


12 


la 5s, 


t = 1.538 
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2 Ho! Mp = Me Ay: lp > Me 
X, = 38.67, Sy? = 5.5827 
x.= 41.625, s?=1.5625 


Sp 


» _ (5 x 5.5827) + (3 x 1.5625) 


= 4.07 
6+4-2 ae 


critical value ¢,(0.05) = 1.86 
critical region is t = 1.860 


(41.625 — 38.67) - 0 


= 2.270 
2.0187 (+1 


In the critical region — reject Ho. 
There is evidence to suggest that the salmon are wild. 


3a 


x,=0.1185 = s?=0.0005227 

x,=0.1425 — s.2=0.0011319 

s2e (5 x 0.0005227) + (5 x 0.0011319) 

6+6-2 

Hot Hi = Me Hi: fn A Me 

critical value ¢, (0.025) = 2.228 

critical region t < 2.228 and t = 2.228 

i (0.1425 - 0.1185) - 0 
0.02876 2 + % 

Not in the critical region — accept Hp. 

There is evidence to suggest that Tetracycline and 

Erythromycin are equally as effective. 

X,=0.2387 — s°=0.0004959 

X,=0.1305 — s.? = 0.000909 

ss (11 x 0.000909) + (5 x 0.0004959) 

. 12+6-2 

Ho: Ms =H2 Hy: Ws # Me 

critical value ¢,,(0.05) = 1.746 

critical region ¢ > 1.746 

1 = (0.2387 - 0.1305) - 0 _ 
0.02795 +3 

In the critical region — reject Ho. 


There is evidence to suggest that Streptomycin is 
more effective than the others. 


= 0.000827 


= 1.445 


= 0.000780 


7.75 


Ho > Hola = Hnew H, > Hola = new 
Xoa = 7.911 Soa? = 5.206 
Xnow = 9.9 — Snow” = 3.98 
s2= (6 x 3.98) + (8 x 5.206) 
i 7+9-2 
critical value ¢,, = 1.761 
critical region t > 1.761 
(7.911 - 5.9) -0 
2.16355 + 
Significant — there is evidence to suggest that new 
language does improve time. 


Once task is solved the programmer should be 
quicker next time with either language. 


27 x 34-384 =534= Sox 


= 4.6806 


Test statistic ¢ = = 1.8446 


Pa > a 
27-1 2727-1) 
Sy? = 31810 


31810 — 12480 = 19330 = Sox? 


%,=32 s2=17.45 

%,=35.6  s% = 22.829 

ot = 04 x 22.829) + (11 x 17-45) _ 99 464 
12+15-2 


Gir Full worked solutions are available in SolutionBank. 


c Ho : fy = Hs Hy: by A Ms 
critical value ¢,;(0.025)= 2.060 
critical region t < -2.060 and ¢ < 2.060 
(35.6 — 32) -0 


~ ine 
4.524/244 
— no evidence to suggest difference in means 

d normality 


same types of driving, roads and weather. 


= 2.0547 - accept Hy 


fa>) 


6 a Ho:04? = 09”, Hy: o,? 4 op”, C.V. = 2.91, test stat = 1.57, 


therefore no evidence that there is a difference in 
variability. Assumption: samples are taken from 
normally distributed populations. 

b Can assume population variances are equal — this 
is one of the fundamental requirements to use two- 
sample t-test. 


C Ho: fa = bp Hy: pra A fp, C.V. = 2.074, test stat = 6.177, 


therefore evidence that there is a difference in the 
average weight of potatoes. 


Challenge 
E[(n,, — 1) s?2 + (n, — 1) sil =E[(n, - 1) s2] + E[(n, - 1) si 
= (n,— 1) s? + (n, - 1) 8? 
= [(n, — 1) + (n, - 1)] s? 
.- Us? -1)s2 
50 E(S?) = E 2 )s2 + (ny — 1)s? 
(n, — 1) + (ny - 1) 
El(n, — sz + (n, - 87] 
[(n, — 1) + (n, - I] 
[(n, — 1) + (ny - 1)lo? 
(7, — 1) + (my - 1)] 
=O. 
So Sgis an unbiased estimator 


Mixed exercise 7 
1 H):=28 H,:p4 28 
Critical region <—2.160 or > 2.160 
— = 4 28 _ 1.4967 


Test statistic = . 3 


we v4 
The test statistic is not in the critical region. 
There is no evidence to suggest that 4 does not = 28 
2 Hjo:f=10 H,:p>10 
Critical region > 1.895 


= 85 
t= "e = 10.625 

2 aae = 
ae = : = _ 970.25 = 10.625? _ 9 599 | 


Test statistic =~" = 10.625 — 10 


Se 
i a8)... 


= 0.571 — not critical — no evidence to 
suggest that . > 10 
3:2 2:52.833... SiH 1722: 
a Confidence interval 


oF il§) roa (9) 8) 
7 1.722... 
= (52.833... - 2.571 x LIER | 52.833... 


1.722... 
ie") 


= (51.025, 54.641) 


+ 2.571 x 


Answers 


b_ Confidence interval = 


(n-1)s* (n- 1)s? 
SG) e-3) 


7 (2 X72 2.53% “5x }peees| 


12.832 ’ 0.831 
= (1.156, 17.850) 
c They are normally distributed. 


4 a Confidence interval 


=F tes($) = SF ta (§) >) 


0.7 0.7 
=|9.8 - 2.110 x ——, 9.8 + 2.110 x —— 
| v18 7B 


= (9.451, 10.148) 


b Confidence interval = 


(8) MEG 


- (44 x 0.49 17x | 
30.191 * 7.564 


= (0.276, 1.101) 


(n-1)s?. (n - 1)s? | 


5 x=20.95 s= 2.674... 


Hy: =21.5 Hy: < 21.5 
critical region < -1.895, 
Test statisncr= = 205521... 
S_ 2.674... 
ue 8 
The test statistic is not in the critical region. 
There is no evidence to reject claim. 


= -0.5817. 


6 x=6.1916.... s=0.7549... s? = 0.5699... 


a Confidence interval 
fe... a\y SF a), S. 
= (5 ta($) x Gao + bea($) x 2) 
0.7549... 
=(6.1916... — 2.201 x —~==~, 6.1916... 
| wane 


0.7549... 
$2,201 x 07549... 
* Vi2 


= (5.712, 6.671) 


b Confidence interval Var. = 


(n-1)s? (n- 1)s? 
EG) 0-3) 
= (4 x 0.5699... 11 x aed 
21.920 : 3.816 
= (0.286, 1.643) 
Confidence interval s.d. = (0.535, 1.282) 


c He should measure his blood glucose at the same 
time each day. 


7xH115 gs =2:073... 
a Confidence interval 


= a 3. a aoe 
= (3 taea($) x qe E+ bea) x Fe) 
2.073... 2.073... 
=(115=2571 (11.5 42.571 x 2073. 
| 7 v6 7 x v6 
= (9.324, 13.675) 


b Confidence interval = 


ale) Gal ~ c 
(2 *21073..:2 5 X ae 
> 12.832 ’ 0.831 
= (1.675, 25.872) 
8a H:o=4 H:o>4 
Critical region x? > 16.919 
(n- 1)s’ _ 9 x 5.2? 
a wie 15.21 
The test statistic is not in the critical region. 
standard deviation = 4 months 


(n-1)s?_ (n - 1)s? | 


Test statistic = 
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b Ho:w=24 Hy: > 24 
Critical region ¢ > 1.833 


Test statistic ¢ = eae et 


2- 
—s 5.2 
va v10 
There is evidence to reject Hy 
The mean battery life >24 
c Lifetime is normally distributed. 
9 %=721.5 s=10.399.... 


a Confidence interval 
= (Eto) 3+ te(9) =) 
= [721.5 — 2.093 x 10-399... 


v20 
10.399... 
721.5 + 2.093 x 72 
. * 20 
= (717, 726) 
b Confidence interval variance 
(n-1)s* (n- 1)s? 


2 (Q@\* 2 =20. 
Xna(Z) Xna(1- 9) 
2 ROS eS 
~\"32.852 «8.907 


= (62.553, 230.717) 
Confidence interval standard deviation = 
(7.909, 15.189) 
c 725 within confidence interval, 
There is no evidence to reject this hypothesis. 


10 a x= 342-342 


10 
2 nye 
ee = - = _ 121.6 - ty eSAD oe eee. 
b i Confidence interval mean 


= (Ft §] SF 3) 5) 


0.7177... 
= [3.42 - 2.262 x ——=—,, 
| ‘. v10 


0.7177... 
3.42 + 2.262 x =H ff 
70 


= (2.906, 3.933) 
ii Confidence interval variance 
(n-1)s? (n—- 1)s? 
x2(0.025)’ x2(0.975) 


= (2 x 0.515... 9x etal 
19.023 ° 2.700 
= (0.244, 1.717) 
Confidence interval standard deviation = 
(0.4937, 1.3103) 

c 3.5 hours is inside the confidence interval on the 
mean, so there is no evidence of a change in the 
mean time. 

0.5 hours is inside the confidence interval on the 
standard deviation so there is no evidence of a 
change in the variability of the time. 

There is no reason to change the repair method. 

d Use a ‘matched pairs’ experiment, getting each 
engineer to carry out a similar repair using the old 
method and the new method and use a paired t-test. 


11 a Normal approximation (n is large), C.I. = (23.69, 
24.31); 
b Sample size is small 
c Ho: w = 25, Hy: pp > 25, c.v. = 2.015, 


test stat = 2.981, hence evidence that male raccoons 
have an average length greater than 25 cm. 
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12 


13 


14 


15 


16 


d: 5, 13, -8, 2, -3, 4, 11, -1 
(Sod = 23, cd? = 409) d = 2.875, sd = 6.9987 (~7.00) 
Ho: 4g=90 Hy: pg >O 
(2.875 — 0) 
~ "6.9987 
v8 
Critical value t;(10%) = 1.415 (one-tailed) 
Critical region is ¢ > 1.415 
Not significant 
Insufficient evidence to support the chemist’s claim 


= 1.1618... (1.16) 


The data were not collected in pairs. 
Use data from twin lambs. 

Age, weight, gender 

d=B-A 

d: 2,1.2,1, 1.8, -1, 2.2, 2,-1.2, 1.1, 2.8 
Sod = 11.9; Sod? = 30.01 


v. @ =1.19; s? = 1.761 (s = 1.327) 
H,:d=0 H,:d40 Allow py ford 


t= 119-0 _ 5 93574. 
L761 


10 
w= 9; C.V.: 6 = 2.262 
Critical regions ¢ < —2.262 or t > 2.262 
Since 2.8357... is in the critical region (¢ = 2.262) 
there is evidence to reject Hy. The (mean) weight 
gained by the lambs is different for each diet. 
e Diet B —it has the higher mean. 
a d: 14, 2, 18, 25, 0, -8, 4, 4, 12, 20 
(Sod = 91; Sod? = 1789) 
- @=91 s=V106.7 
No: #a=9O Hy: pa FO 
_ (9.1 - 0) 
~ 10.332 
v10 
Critical value t) = +1.833 
critical regions ¢ < -1.833 or t = 1.833 
Significant. There is a difference between blood 
pressure measured by arm cuff and finger monitor. 
b_ The difference in measurements of blood pressure 
is normally distributed. 
Differences: 2.1, -0.7, 2.6, -1.7, 3.3, 1.6, 1.7, 1.2, 1.6, 
2.4 
Yid=14.1 Sod?=40.65 d=1.41 
Ho: ta@=O Hy: py > 0 
_ | 40.65 - 10 x 1.412? 


aoc re 


10.332... 


= 2.785 


s = 1.5191... 
14.” 
: (us reget 
v10 


(1%) = 2.821 

so critical region t > 2.821 

2.935... > 2.821 Evidence to reject Ho. 

There has been an increase in the mean weight of the 
mice. 


5136.3 2252 


Bas =8.2 
a so «1010 - 1) 
6200 2342 
2 o =145 
Sn =~ 99-1) 
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b Hj:0,2=0,? H,:0,2<0,? 
critical value is Fy = 3.23 
so critical region, F > 3.23 


Fw ue = 1.768 


not in critical region 

accept Hy - there is evidence to suggest that o,? = 0,” 
‘ 52 = 9% 8.2) + Bx 14.5) 

p 10+9-2 

Hot Ho = Hn Hy: My A Mn 

critical value ¢,,(0.01) = 2.567 

critical region t < -2.567 and t = 2.567 

= (26 — 22.5) - 0 

Viti6a7 (+4 
Not in the critical region — do not reject Ho. 
There is evidence to suggest that there is no difference 


in mean times between the old and new equipment. 

d t,,(2.5%) = 2.110 

(26 - 22.5) + 2.110 x V11.1647 x /75 +3 
= 3.5 + 3.2394 = (0.261, 6.739) 

e Need to learn new equipment 

f Gather data on new equipment after it has been 

mastered. 

Ho2'o4? = 0,7, Hy on? A 0,7, C.V. = 2,15, 

test stat = 1.53, therefore no evidence that there is 

a difference in variability. Assumption: samples are 

taken from normally distributed populations. 

b Can assume population variances are equal — this 
is one of the fundamental requirements to use 
t-distribution. 

c (0.186, 2.01) 


= 11.1647 


= 2.2798 


17 a 


Challenge 
> (n,- 1S? + (n, - IS) + (n,- 1S? 
a.5,= 
e n, +N, +n, -3 
(n, — 1)S? + (n, - 1S? + (n, - 1S? 
b E(S3)=E u 


n, +n, +n,—3 
E(n, - 1)S? + (n, — DS? + (m, - 1)S? 
Nn, +N, +n, - 3 
((n, - 1) + (n, = 1) + (n, - Do? 


n, +n, +n,—3 
1 fon 
So S$} is an unbiased estimator. 


Review exercise 2 
1 Hy: w=18; Hy: p< 18. 
-1.9364... < -1.6449 so reject Ho. There is evidence 
that the (mean) time to complete the puzzles has 


reduced. 
2a ¥=4.52,82=1.51(3 sf) 
b Ho: fa = oes Fy: fa > bp. 


1.868... > 1.6449 so reject Hy. There is evidence 
that diet A is better than diet B or evidence that 
(mean) weight loss in the first week using diet A is 
more than with diet B. 

c The central limit theorem enables you to assume 
that A and B are both normally distributed since 
both samples are large. 

d Assumed o4? = S42 and op? = Sp? 


10 


11 


12 


13 


14 


15 


16 


(12 
a 


o 


seme omn o 


see 


ao 2 


b 
Fro, 


a= 


Answers 


7,151) to3 sf. 

xX = 50, s? = 0.193 (3 s.f.) 

(49.7, 50.3) 

(49.6, 50.4) 

* =110.5, s* = 672 (3 sf.) 

(95.0, 126) to 3.s.f. 

0.4633 (4 d.p.) 

X = 168, s* = 27.0 (3 sf.) 

(166, 170) to 3 s.f. 

Ho: Me = bs Fy: be A Ba 

2.860... > 1.96 so reject Hy. There is evidence of a 
difference in the (mean) amount spent on junk food 
by male and female teenagers. 

The central limit theorem enables you to assume 
that F and M are both normally distributed since 
both samples are large. 

X = 287, s* = 7682.5 

Sample size (=) 97 required. 

(3.28, 11.8) to 3.s.f. 

a =3s80 07 = 9. Since 9 lies in the interval, the 
confidence interval supports the assertion. 
(0.00164, 0.00719) to 3 s.f. 

0.07? = 0.0049 

0.0049 is within the 95% confidence interval. There 
is no evidence to reject the idea that the standard 
deviation of the volumes is not 0.07, or the machine 
is working well. 

Hjo:0 =4; H,:0 > 4. 

v=19, x?, (0.05) = 30.144 


(n-1)S?_ 19 x 6.25 
a 4 

Since 29.6875 < 30.144 there is insufficient 

evidence to reject Hy. There is insufficient evidence 

to suggest that the standard deviation is greater 

than 2. 

Times are normally distributed. 

y2 (5%) = 2.75 so b = 2.75 


= 0.344 


= 29.6875 


a 
2.91 


P(X > 2.85) = 0.05 


at 


P(x Z =| =0.01 


So 


a 


of 


5.67 
1 


Pe <X< 2.85) =1-0.05 - 0.01 = 0.94 
5.67 


Hy: 042 = oR’; H, 142 x op’. 
Critical value F,, 5; = 1.96 
2 
22 210 
S42 
Since 2.10 is in the critical region, we reject Hy and 
conclude there is evidence that the two variances 
are different. 
The lengths of pebbles are normally distributed. 
(10.4, 449) to3s.f. 
Ho: 0n2 = 052; Hy: 0? > as? 
Su* _ 318.8 
— = = 9.859... 
S52 32.3 
F63(0.01) = 27.91 
9.859... < 27.91 so do not reject Ho. There is 
insufficient evidence of an increase in variance. 
(1.13, 2.23) to 3 sf. 
(1.09, 3.46) to 3 s.f. 
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Answers 


2.5 - 


c Require P(X > 2.5) = oz > to be as small 


2.5 - 
a 


as possible OR to be as large as possible; 


so both both o and yu must be as small as possible. 
Lowest estimate for proportion with high 
cholesterol 


_ °(z 5 2.5 -1.13 
v1.09 


= 1- 0.9049 = 0.0951 = 9.51% 
17 a (3.44, 4.58) to 3.s.f. 
b (0.302, 2.13) to 3 sf. 
7~- 
c Require P(X > 7) = (z > lies to be as high as 


possible; therefore both o and y must be as large as 
possible. 

Highest estimate for proportion with high blood 
glucose 


=?(2> 


) = P(Z > 1.31) 


7- At) 

v2.13 
= 1-0.9515 = 0.0485 = 4.85% 
i (119, 127) to 3 s.f. 
ii (16.3, 115) to 3 sf. 
b 130 is just above the confidence interval. 16 is just 
below the confidence interval. Thus the supervisor 
should be concerned about the speed to the new 
typist since both their average speed is too slow and 
their variability of time is too large. 
Hj:07 =0.9 H,:0? 40.9 
v=19 
CR (Lower tail 10.117) 

Upper tail 30.144 


Test statistic = Wags = 31.6666, significant. 


18 a 


19 a 


There is sufficient evidence that the variance of the 
length of spring is different to 0.9. 

b Ho:u=100 H,:4 > 100 
ty) = 1.328 is the critical value 


_ 100.6- 100 _9 49 
1.5 
20 
Significant. The mean length of spring is greater 


than 100. 
- 66002) _ 
84? = 7h(3.960 540 - 2600") _ 54.0 


t 


20 a 


1 9815? 
es 3(7 410579 — ae = 21.16 
Ho: 042 = op?, Hy: 04? 4 op? 
Critical region: F192 > 2.75 
$4? 54.0 


= 220 = 2.55118... 
Sp? 21.16 


Since 2.55118... is not in the critical region, we can 
assume that the variances are equal. 
b Ho: p= 4 t+ 150; Hy: pg > y+ 150 
C.R.: t2:(0.05) > 1.717 
oe 10 x 54.0 + 12 x 21.16 _ 36.0909 
. 22 
755 - 600 - 150 
Taal 

36.0909...(44 + 7) 


= 2.03157 


Since 2.03... is in the critical region we reject Ho 
and conclude that the mean weight of cauliflowers 
from B exceeds that from A by at least 50g. 


252 


21 


22 


23 


c Samples from normal populations 
Equal variances 
Independent samples 


io] 


c 
d 


is? = 32.49 Sy? = 12.25 
Ho: 0¢? = oy’, Hy :0¢2 > oy? 
Critical value: Fg = 3.23 
Sc? __ 32.49 
Sy2 12.25 
2.652... < 3.23 so do not reject H) and assume 
the variances are equal. 

ti Ho:¥¢=Xy, Hy:¥e> Fy 
v=17 
Critical value ¢,;(0.05) = 1.740 
ape (9 x 3.57) + (8 x 5.7%) 

17 
82.3 - 78.2 


(21.774...(5 +4) 


1.912... > 1.74 so reject Ho, there is evidence the 
mean scores differ. 
Independent and samples from normal population. 
(12.4, 47.6) 
Hyioy? S057, Hyia,? 4 a;7 
$,2= 22.5 > 8,2 = 21.6 


= 2.652... 


= 21.774... 


=1,912... 


2 
“a= 1.04 
Sp 
Fig = 4.15 


1.04 < 4.15 do not reject Hy. The variances are the 
same. 
Assume the samples are selected at random 
(independent) 
2 8(22.5) + 6(21.62) 
eo 14 
Ho: Ha = He Hy: Ma # Mp 
+ — 40.667 - 39.57 _ 9 ago 
je2.a2{f + 
Critical value = ¢,,(2.5%) = 2.145 
0.462 < 2.145 No evidence to reject Ho. 
The means are the same. 
Music has no effect on performance. 
H,:0,? 4 0,2 
8° _ 35.79 
aa 2.4716... 
Not significant so do not reject Hp. 
Insufficient evidence to suspect o,? 4 0,7 
Ho! Ha = He» Hythe A Me 
g2 = £% 35.79 + 12 x 14.48 
18 
_ 32.31 - 28.43 
S\ig+7 
t1(5%)> tai, CV. = 2.101 
~. Not significant. 
Insufficient evidence of difference in mean 
performance. 
Test in part b requires 0, = 0,” 
e.g.same: type of driving 
roads and journey length 
weather 
driver 


22.12 


. 2 = 2 
Ho: op” = oF 


F642(5%); jai, C-V. = 3.00, 


= 21.583 


t = 1.78146... 
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24 


25 


26 


27 


28 


a X%=668.125 s = 84.425 
t,(5%) = 1.895 


_ 1.895 x 84.425 
- v8 
= 611.6 and 724.7 
Confidence interval = (612, 725) 
Normal distribution 
c £650 is within the confidence interval. 
No need to worry. 
Hy:#=1012 Hy: 41012 
13 700 
14 (= 978.57...) 
ste 13 448 750 - 14x? 
“ 13 


X-" _ 978.6 - 1012 
tis =" 57.06 =-2.19... 
me v4 
t,,(5%) two-tail critical value = -2.160 
Significant result - there is evidence of a change in 
mean weight of squirrels. 


Let x represent weight of flour 
Sox = 8055». = 1006.875 
yee 8110611 «se 1{s 110611 - S082 
= 33.26785... 
2.8 = 5.767825 
Hy: = 1010; Hy: < 1010 
Critical value: ¢ = -1.895 so critical region ¢ < -1.895 
(1006.875 - 1010) 
~ (57878) ~ 
v8 


Since -1.53 is not in the critical region there is 
insufficient evidence to reject Ho. 

The mean weight of flour delivered by the machine is 
1010. 


9, 2-8 1,1, 8 165 
Sod = 19; rd? = 193 


ae =9 975: 5? = 


Confidence limits = 668.125 4 


L= 


(= 3255.49) 


-1.5324 


1 - 1¥} - 
A193 8 = 21.125 
Ho:4p=9; Hy: up > 0 


p= 23 =0 _ 4 4615. 


v=7 = critical region: ¢ > 1.895 

Since 1.4195... is not in the critical region there is 
insufficient evidence to reject H, and we conclude that 
there is insufficient evidence to support the doctors’ 
belief. 


a d=coursework — written: 4, -3, -3, 4, 6, 3, -4, 17, 
7,7 
_ _ 2 : 
a= 38 9.8 52-498 = 100" _ 39 98 
10 
test statistic: t = AS = 1.917... 
v10 


Ho:a=0, Hy: pg > 0 
to(5%) c.v. is 1.833 
. Significant - there is evidence coursework marks 
are higher. 
b The difference between the marks follows a normal 
distribution. 


29 D = dry - wet 


Answers 


Ho: 4p =9, Fy: pp #0 
d: 0.6, -1, -1.9, -1.4, -1.3, 0.5, -1.6, -0.6, -1.8 


_ 15.03 - 9 x dd)? 


d: - 8-3 = 0.94, 5? . 


= 0.87527... 


t= ae = awrt -3.03 
0. 


v9 


t, two-tailed 1% critical value = 3.355 
Not significant — insufficient evidence of a difference 
between mean strength. 


30 a ee where d = 


31 


32 


Hy: pg > 0 without solar heating — with solar 
heating 
d= 6, -3, 7, -2, -8, 6, 5, 11,5 
d=3 
S,=6 
ny =9 
t test statistic = 30) 
(35) 


t.s.=1.5 

Critical value = ¢,(5%) = 1.860 

so critical region: ¢ > 1.860 

Test statistic not in critical region so accept Hy. 
Conclude there is insufficient evidence that solar 
heating reduces mean weekly fuel consumption. 


b_ The differences are normally distributed. 


’ gf = LATBA+T XA 5.99 


S$, = 2.433105 

Ho: Ha =a, Hy: pa > be 

_ 26.125 — 25 
2.43/54 3 

ty4(2.5%) = 2.145 

Insufficient evidence to reject Hp. 


Conclude that there is no difference in the means. 


b d=2,5,-2,1,3,-4,1,3 
d=2=1.125 
» _ 69 - 8 x 1.125? 


t = 0.92474 


Sa = 8.410714 


t(2.5%) = 2.365 
There is no significant evidence of a difference 
between method A and method B. 

c Paired sample as they are two measurements on 
the same orange. 

a Confidence interval is given by 


Ss 
E lig X Te 


sce + [3.2 
i.e: 207.1 + 2.539 x 55 


Le.: 207.1 + 1.0156 
Le.: (206.08..., 208.1156) 


xa 
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418 785.4 - 2046.2* 
—e 10 
9 
soz = 10.2173 
— 2046.2 _ 
Hq = 2046.2 - 204.62 
2219 3.2 +9 x 10.2173 
D 28 
= 5 ARSS7.:. 


Confidence interval is given by 
Hey — Hig £ toy x 5.45557( 4 + 1) 


i.e.: (207.1 - 204.62) + 1.701)/5.45557(35 + 75) 


Le.: 2.48 + 1.53875 
iLe.: (0.94125, 4.0187) 


Challenge 
Ta EGGX: ~ 2X2 + GXs) = Gu gut GH 
E(Y) =, = unbiased 
b E(aX, + 0X2) = ay + bu =p 
a+b=1 
Var(aX, + 0X,) = a?a? + bc? 
= a’o? + (1 - a)’o? = (2a? - 2a + Wc? 
c Minimum value when (4a - 2)o? = 0 
=> 4a-2=0 
a=3,b=4 
2 a Consider cumulative distribution functions: 
X has c.d.f. F(x) = x in the interval 0 < x < 1. 
In order for M@ < x, you must have no more than 1 
of the X; > x. 
So either all three X, < x with probability F()? = x°. 
Or exactly one of the X; > x, with probability 
3 x F(x)? x (1 - F(w)) = 3x°(1 - x) since it can happen 
in 3 different ways. 
So P(M < x) = «3 + 3x°(1 — x) = 3x? - 2x3 
So M has c.f. 


1 x>1 
Differentiating, M has p.d-f. 
nex) = {5 5 O<x<l1 


0 x<0 
F(v) =4 3x2 - 2x3 O<x<l1 


0 otherwise 


b E(M) = [ xx — 6x?) dx = [2x3 - 3a] = 0.5 = E(X) 


aft _ [3 645)1 _ 
c EWM?)= i x? (6x — 6x7) dx = [3x4 - $a] = 0.3 
Var(M) = E(M2) — (E())? = 0.3 — 0.52 = 0.05 
Standard error of M = 0.05 = 0.224 (3 s.f.) 


Exam-style practice: AS level 
la f(xjA 


i) 
| 

co 
t 
l 
1 
1 
l 
l 
I 
l 
l 
l 
l 


ae 


i) 
a 
i) 

R 


se on eo & 


S tc 


RSS = 5, - 


Mode is 1 


2 k 2 ‘ 
| 2 -kx)dx = [2"- ra =1sk=3 
il 3 F 
39 
3g (= 1.393) 
0 S11 
F(x) = { 2x - 443-8 1<x<2 
1 wo 2 
F(m) = 0.5 
= 2m-tm3-B=05 
=> 2m3 — 28m + 33 =0 
Mean > Median > Mode therefore POSITIVE skew. 


Oxx =< 40 
otherwise 


zs 
f(x) = ff 
0 


0.455 (3 s.f.) 

Hp: There is no association between French and 
Spanish scores. H,: There is an association between 
French and Spanish scores. Critical value is 0.6485 
(two-tailed test) and 0.455 < 0.6485 so accept 

the null hypothesis — there is no evidence of an 
association between French and Spanish scores. 
Ho: p = 0; Hy: p > 0; critical value is 0.5494 (one- 
tailed test) and 0.568 > 0.5494 so reject the 

null hypothesis — there is evidence of a (positive) 
correlation between French and Spanish scores. 
Spearman’s rank does not use the actual data, just 
the ranks. 


~Sye- LIZ" 


(Sec)? 
= 0.852 - 
Ba cam 


= 561.3 - -7.18 


88 x 32.3 _ 
5 


7.18)? _ 9.0613 (3 sf) 


Thus since 0.0524 < 0.0613, we conclude that ice- 
cream sales are more likely to have a linear model. 


b 


Exam-style practice: A level 
1 a 0.665 


0.312 


2 a Let of be variance with fertiliser; 


Hoze2= of; H,: 024 of 


F549 critical value = 3.07 

So the test is not significant - the variances can be 

assumed equal. 

Let 4, be mean with fertiliser; Ho: = yp; Hi: we A by 
29x 5.29? + 12 x 6.84? 


= 38.72... 
P 10+ 13-2 
So¢s2226= we = = 1.298... 
J38.72... x ja 4 
3 “V0 7 13 


ts, 5% two-tail c.v. is 2.080. 

Test is not significant — there is insufficient evidence 
of a difference in mean heights. No evidence to 
support the idea that the fertiliser increases the 
average height gained. 
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seam 


aon f 


The test in part b requires that both the variances 
are equal. The test in part a established that this 
was reasonable. 


95% C.I. uses ¢ value of 2.571; 
= x 2.571 = $(223.5 - 206.2) > 6 = 8.241... 


=> 6 = 67.9 (3 sf.) 
(5.14, 20.2) 
Let S = span of an adult male’s hand 


230 = 
P(S > 230) = P(z > = | 


For a maximum value, you need largest » and 
largest o 


P(z > 28 228.3 | = P(Z > 0.3218) 


20.2 
=1-0.626 
= 0.374 
So highest estimate of the proportion is 0.37 (2 d.p.) 


Sin = 272094 - {age = 1000.2 


S., = 2878966 - oe = 2550 


Sie = 884.484 — 1202 5 2088. = 44333 


p — 1433.3 _ 


= 7000.27 1.433015 


a= ones = bx 1oee ="916.6256 


c= 317+ 1.43h 

For every 1 cm increase in height, the confidence 

measure increases by 1.43. 

ais the value of c when h = 0 which makes no 

sense since this implies a person could be 0 cm tall. 

i -3.97, 2.33, -5.41, 6.62, -8.66, 5.01, -5.23, 
-5.11, 15.76 

ii 573 since residual is far greater than the others. 

c= 297+ 1.534h 

561 

0.810 (3 s.f.) 

Hy: p = 0, H,: p > 0. Critical value is 0.6429 < 0.810, 

so at the 5% significance level there is evidence to 

suggest a positive correlation between qualifying 

lap-times and race results. 

Race ranks are not measurable on a continuous scale. 

Data will have 4 values with tied rank. Assign a 

rank equal to the mean of the tied ranks. Calculate 

the PMCC directly from the ranked data rather than 

using the formula. 

and 

1 since f(x) is increasing function for 0 < x < 1. 

Mode > mean therefore negative skew 

Substitute 3k and k into F(x) and subtract before 

simplifying to get the required result. 
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bias 114 
biased estimators 114 
bivariate normal distribution 22 


census 110 
central limit theorem 120, 121, 127, 
132-3 
chi-squared distribution 142, 147 
coding 8-9 
linear 22 
combinations of random 
variables 92-8 
conditional probability 72 
confidence intervals (C.I.) 120-4 
difference between means from 
normal distributions 182-3 
interpreting 122 
mean of normal distribution with 
unknown variance 167-8 
variance of normal 
distribution 143 
width 123 
confidence limits 121 
difference between means from 
normal distributions 182-3 
variance of normal 
distribution 143 
continuous data, analysis 22 
continuous distributions 44-91 
see also continuous uniform 
distribution 
continuous random variables 45-8 
mean 56-9 
mean of function of 57, 59 
median 64 
mode 63, 67 
percentiles 64—5 
quartiles 64-5 
variance 56-9 
variance of function of 57, 59 
continuous uniform 
distribution 71-6 
cumulative distribution 
function 75 
mean 74-6 
modelling with 79-80 
variance 74-6 
correlation 21-43 
hypothesis testing for zero 
correlation 33-4 
product moment correlation 
coefficient (PMCC) 13, 22-4, 27, 
33-5 
for sample 33 
Spearman’s rank correlation 
coefficient 26-30, 34-5 
for whole population 33 
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critical region 33, 34 
cumulative distribution functions 
(c.d.f.) 51-3 
continuous uniform distribution 
75 
notation 51 


discrete random variables 45 
distributions, bivariate normal 22 


errors 
modelling for 80 
standard 115-16 
estimates 113 
estimators 113-16 
expected value see mean 
extrapolation 2, 3 


F-distribution 149-54 
critical values 151-3, 156 
order of parameters 150 
paired ¢-test 174-7 

F-test 149, 155-7 


hypothesis testing 

for difference between means 
127-30, 185-7 

for mean of normal distribution 
with unknown variance 170-3 

for variance of normal 
distribution 146-7 

for zero correlation 33-5 


interpolation 2, 3 


least squares regression lines 2-5, 9 

linear correlation 22 

linear fit, checking reasonableness 
10-12 

linear regression 1—20 

lower critical value 152-3 


mean 

combinations of random variables 
93-5 

continuous random variable 56-9 

continuous uniform distribution 
74-6 

difference between means of two 
normal distributions 180-3 

function of continuous random 
variable 57, 59 

hypothesis testing for difference 
between means 127-30, 185-7 

of normal distribution with 
unknown variance 164-8, 170-3 

pooled estimate 131 

population 110, 113 

sample 110-13, 115, 120 


median, continuous random 
variable 64 
mode 
continuous random variable 63, 67 
population 113 
sample 111-13 
modelling, with continuous uniform 
distribution 79-80 


negative exponential distribution 90 
negative skew 66 
normal distribution 
bivariate 22 
combination of normally distributed 
random variables 93—5 
difference between means of two 
180-3 
hypothesis testing for difference 
between means of two 127-30, 
185-7 
hypothesis testing for mean with 
unknown variance 170-3 
hypothesis testing for variance 
146-7 
mean with unknown variance 
164-8 
probability density function 45 
standardised 120-1, 123 
variance 142-4, 146-7, 155-9 


one-tailed tests 33, 35 
outliers 12 


paired t-test 174-7 
percentiles, continuous random 
variable 64-5 
pooled estimate of population 
mean 131 
pooled estimate of variance 180, 194 
population mean 110, 113 
population mode 113 
population parameters 110 
population variance 114, 132-3 
unbiased estimator 142-4 
positive skew 66, 67 
probability density functions 
(p.d.f.) 45-8 
normal distribution 45 
notation 51 
symmetric 58 
probability mass function 45 
product moment correlation 
coefficient (PMCC) 13, 22-4, 27, 
33-5 
critical values 33 


quartiles, continuous random 
variable 64-5 


random samples 110 
random variables 
combinations of 92-8 
continuous see continuous random 
variables 
discrete 45 
independent 93 
rankings 27 
tied ranks 28, 29 
regression lines 1—5, 8-9 
residual plot 11 
residual sum of squares (RSS) 2, 13 
residuals 2, 10-13 
sum of 10, 11 


sample mean 110-13, 115, 120 

sample mode 111-13 

sample variance 114, 132 

sampling distribution 111 

skewness 66-7 

Spearman’s rank correlation 
coefficient 26-30, 34-5 


critical values 34 

standard deviation see variance 

standard error 115-16 

standardised normal 
distribution 120-1, 123 

statistic, definition 110 

Student’s f-distribution see 
t-distribution 

summary statistics 2 


t-distribution 164-8, 170-3, 181, 
185-7 
critical values 165 
tied ranks 28, 29 
t,, value 167 
two-tailed tests 33, 34, 129-30, 133, 
147 


unbiased estimates 113 
unbiased estimators 113, 114 
unskewed distribution 66 
upper critical value 152-3 


variance 


combinations of random variables 
93-5 

continuous random variable 56-9 

continuous uniform distribution 
74-6 

function of continuous random 
variable 57, 59 

normal distribution 142-4, 146-7, 
155-9 

pooled estimate 180, 194 

population 114, 132-3 

sample 114, 132 

unbiased estimator of population 
142-4 


y on x regression line 2 
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